20  Zigbee Routing and Self-Healing

AODV routing protocol, path discovery, and automatic mesh recovery

In 60 Seconds

Zigbee uses AODV (Ad-hoc On-Demand Distance Vector) routing to discover paths through the mesh network. When a device needs to send data, it broadcasts a Route Request (RREQ); intermediate routers forward it until the destination replies with a Route Reply (RREP), establishing the path. If a link breaks, the mesh self-heals by triggering a new route discovery. This on-demand approach conserves bandwidth (no periodic routing updates) but adds latency for the first message to a new destination.

20.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Explain how AODV (Ad-hoc On-Demand Distance Vector) routing discovers paths on demand in Zigbee networks
  • Trace the route discovery process through RREQ broadcast, RREP unicast, and routing table creation
  • Analyse how Zigbee mesh networks detect link failures and self-heal through RERR propagation and route rediscovery
  • Calculate expected routing latency for multi-hop paths including route discovery overhead
  • Design networks with adequate path redundancy to ensure reliable self-healing under router failure

20.2 Introduction

Zigbee’s mesh networking capability relies on sophisticated routing protocols to deliver messages across multiple hops. The primary routing mechanism is AODV (Ad-hoc On-Demand Distance Vector), which discovers routes only when needed, saving memory and reducing overhead on resource-constrained devices.

Imagine you need to send a package across the country, but there’s no direct route. Instead, the package travels through multiple cities:

Your city → City A → City B → City C → Destination

Each city is like a Zigbee Router, and the package is your message. Routing protocols figure out: 1. Which cities (routers) to use 2. What to do if a city (router) is unavailable 3. How to find the best path

AODV is the “GPS” of Zigbee - it finds routes when you need them.

20.3 AODV Routing Protocol

AODV (Ad-hoc On-Demand Distance Vector) is a reactive routing protocol - it discovers routes only when traffic needs to flow, rather than maintaining routes to all destinations proactively.

20.3.1 Why On-Demand Routing?

Approach Memory Usage Network Traffic Route Freshness
Proactive High (all routes) High (periodic updates) Always fresh
Reactive (AODV) Low (active routes only) Low (on-demand) Fresh when used

For resource-constrained Zigbee devices with 8-32KB RAM, reactive routing is essential.

20.3.2 Route Discovery Process

When a device needs to send a message to a destination without a known route:

Diagram showing Aodv Discovery
Figure 20.1: AODV route discovery showing RREQ broadcast, RREP unicast, and subsequent data transmission

20.3.3 RREQ (Route Request)

When a device needs a route, it broadcasts an RREQ:

RREQ Message Contents:
- Source Address: 0x0023 (the sensor)
- Destination Address: 0x0000 (the coordinator)
- RREQ ID: 42 (unique per source, used with Source Address to detect duplicates)
- Hop Count: 0 (incremented at each hop)
- Radius: 5 (maximum hops allowed, called TTL in generic AODV)

RREQ Propagation:

  1. Source broadcasts RREQ to all neighbors
  2. Each Router that receives RREQ:
    • Checks if it is the destination → If yes, send RREP
    • Checks if already seen this RREQ (by Source Address + RREQ ID pair) → If yes, drop
    • Otherwise, increment hop count and rebroadcast
  3. RREQ floods through network until destination is reached

20.3.4 RREP (Route Reply)

When the destination (or a router with a fresh route) receives the RREQ:

RREP Message Contents:
- Source Address: 0x0000 (coordinator)
- Destination Address: 0x0023 (original requester)
- Hop Count: 2 (hops from destination)
- Route Lifetime: 60 seconds (how long to cache route)

RREP Propagation:

  1. Destination sends RREP back along the path RREQ arrived
  2. Each Router stores the reverse route (toward source)
  3. RREP travels unicast (not broadcast) - efficient
  4. Source receives RREP and caches the route

20.3.5 Route Table Entries

After route discovery, each device stores route information:

Router 1 Routing Table:
| Destination | Next Hop | Hop Count | Lifetime |
|-------------|----------|-----------|----------|
| 0x0000      | 0x0000   | 1         | 60s      |
| 0x0023      | 0x0023   | 1         | 60s      |

Sensor Routing Table:
| Destination | Next Hop | Hop Count | Lifetime |
|-------------|----------|-----------|----------|
| 0x0000      | 0x0001   | 2         | 60s      |

20.4 Route Maintenance

Routes don’t last forever. AODV includes mechanisms for maintaining valid routes:

20.4.1 Route Expiration

Route Lifecycle:
1. Route discovered → Lifetime timer starts (60s typical)
2. Data transmitted → Timer reset to full value
3. No activity → Timer counts down
4. Timer expires → Route marked invalid
5. Next transmission → New route discovery required

20.4.2 Route Error (RERR)

When a link fails (device offline, out of range), the detecting device sends RERR:

RERR Trigger Conditions:
- MAC-layer ACK not received after retries
- Neighbor timeout (no heartbeats)
- Explicit device leave notification

RERR Message:
- Unreachable Destination: 0x0001 (failed router)
- Affected Routes: List of destinations via failed router
Diagram showing Rerr Propagation
Figure 20.2: Route error detection and automatic recovery through alternate path

20.5 Self-Healing Mesh

Self-healing is one of Zigbee’s most valuable features for reliability-critical deployments.

20.5.1 Self-Healing Timeline

When a Router fails, the network recovers automatically:

T = 0: Router fails (power loss, damage, interference)

T = 0-300ms: Failure detection
- Devices sending to failed router don't receive ACKs
- 3 retries × 100ms timeout = 300ms

T = 300ms-3s: Route invalidation
- Devices mark routes via failed router as invalid
- RERR propagates to affected sources

T = 3-10s: Route rediscovery
- Affected devices broadcast new RREQs
- Parallel discovery for all affected routes

T = 10s: Network recovered
- All devices have new routes
- Traffic flows through alternate paths

20.5.2 Redundancy Design

For reliable self-healing, design networks with path redundancy:

Minimum Redundancy (N+1):

Every end device should have at least 2 routers in range
If Router A fails → Route through Router B

Recommended Redundancy (N+2 or N+3):

3-4 routers in range of each end device
Multiple alternate paths available
Faster recovery, better load balancing

20.5.3 Self-Healing Verification

Test your network’s self-healing capability:

Test Procedure:
1. Monitor message delivery rate (baseline)
2. Disable one router (power off)
3. Measure recovery time (messages start flowing again)
4. Verify new routes in device tables
5. Re-enable router, verify mesh rebalances

Expected Results:
- Recovery time: 5-15 seconds
- Message loss during recovery: 1-5 messages
- Full restoration of delivery rate

20.6 Latency Considerations

Multi-hop routing adds latency. Understanding this helps set appropriate expectations:

20.6.1 Per-Hop Latency

Component Time
CSMA/CA backoff 5-20ms
Transmission (127 bytes @ 250Kbps) 4ms
Processing at router 1-5ms
Total per hop 10-30ms

20.6.2 End-to-End Latency Examples

Path Length Typical Latency Worst Case
1 hop 10-30ms 50ms
2 hops 20-60ms 100ms
3 hops 30-90ms 150ms
5 hops 50-150ms 250ms

20.6.3 Interactive: Zigbee Routing Latency Estimator

Explore how hop count and route discovery affect end-to-end Zigbee latency.

20.6.4 First Message Latency

The first message to a new destination incurs route discovery overhead:

Route Discovery Time:
- RREQ flood: 100-500ms (depends on network size)
- RREP return: 50-200ms
- Total: 150-700ms (first message)

Subsequent Messages:
- Use cached route: 10-30ms per hop

Design Implication: For latency-sensitive applications, keep hop counts low (3-4 hops maximum) and consider pre-establishing routes during network formation.

20.6.5 Why AODV and Not RPL? Zigbee’s Routing Protocol Choice

Zigbee adopted AODV (a reactive, on-demand protocol) rather than RPL (a proactive, tree-based protocol used by Thread and 6LoWPAN) for reasons rooted in the constraints of early 2000s hardware and Zigbee’s target use cases.

Memory constraints drove the decision. RPL requires each node to maintain a Directed Acyclic Graph (DAG) with parent sets, rank information, and Trickle timers – approximately 200-500 bytes of RAM per routing entry. In 2004 when Zigbee 1.0 was standardized, typical target platforms (Ember EM250, Freescale MC1322x) had 4-8 KB of available RAM after the protocol stack. AODV stores routing entries only for active destinations – a device communicating with 3 destinations uses approximately 36 bytes (12 bytes per entry), compared to RPL’s 200+ bytes for the same topology awareness.

Traffic pattern assumptions also mattered. RPL optimizes for many-to-one collection traffic (sensors reporting to a border router), which is the dominant pattern in industrial monitoring. Zigbee was designed for home automation where traffic patterns are more diverse: a light switch sends commands to specific lights (point-to-point), a remote control sends to a media center, and a door sensor reports to a hub. AODV handles arbitrary point-to-point traffic efficiently because it discovers only the routes actually needed.

The trade-off is visible in practice. AODV’s first-message latency (150-700 ms for route discovery) is noticeable when you press a Zigbee light switch for the first time after the route has expired. Thread devices using RPL respond in under 50 ms because the route is already established. However, AODV’s lower memory footprint allows Zigbee to run on cheaper, more constrained chips – a meaningful cost advantage at scale when deploying hundreds of sensors.

20.7 Alternative Routing Methods

While AODV is the primary routing protocol, Zigbee supports additional methods:

20.7.1 Tree Routing

Hierarchical routing based on network address structure:

Address-Based Routing:
- Coordinator: 0x0000
- Router A (child of Coordinator): 0x1000
- Router B (child of Router A): 0x1100
- End Device (child of Router B): 0x1110

Route to 0x1110:
0x0000 → 0x1000 → 0x1100 → 0x1110
(Follow address hierarchy)

Advantage: No route discovery needed - address implies route Disadvantage: Rigid structure, no alternate paths

20.7.2 Source Routing

Sender specifies the complete path in the packet:

Source Route Header:
- Path: [0x0001, 0x0002, 0x0003, 0x0000]
- Each router forwards to next in list

Advantage: Predictable path, no route lookups at routers Disadvantage: Larger packet headers, sender must know full path

20.7.3 Many-to-One Routing

Optimized for sensor networks where most traffic flows toward a central collector (the concentrator, typically the Coordinator):

1. Coordinator broadcasts a Many-to-One Route Request
   (a special RREQ with the many-to-one flag set)
2. Every router that receives it creates a routing table
   entry pointing toward the Coordinator (reverse route)
3. When a device sends data to the Coordinator, each
   router along the path appends its address to a
   Route Record frame
4. The Coordinator uses collected Route Records to build
   source routes for downlink (Coordinator → device) traffic

Advantage: Eliminates per-device RREQ floods for the dominant uplink traffic pattern; scales to large networks Disadvantage: Only optimizes traffic toward the concentrator; downlink still requires source routing or on-demand AODV

20.8 Worked Example: Zigbee Routing in a Three-Floor Office Building

Scenario: An office building deploys 120 Zigbee-based occupancy sensors across three floors, with a single Coordinator on Floor 2. Each floor has 8 routers (mains-powered smart plugs) and 32 battery-powered occupancy sensors (end devices).

Network Dimensions:

  • Floor area: 40m x 25m per floor
  • Ceiling height: 3m (concrete slab between floors)
  • Router spacing: ~10m apart (within Zigbee’s 10-30m indoor range)
  • Maximum distance from any sensor to nearest router: 8m

Step 1: Calculate Maximum Hop Count

Floor 2 sensors → local Router → Coordinator (1-2 hops)
Floor 1/3 sensors → local Router → floor relay Router → Coordinator (2-4 hops)
Worst case: Corner of Floor 1 or 3

Path: Sensor → Router (floor corner) → Router (floor center) →
      Router (near stairwell) → Coordinator
Hops: 4 (within the 5-7 recommended maximum)

Step 2: Calculate End-to-End Latency

Per-hop latency: 10-30ms (CSMA/CA + transmission + processing)
4-hop worst case:
  Best: 4 x 10ms = 40ms
  Typical: 4 x 20ms = 80ms
  Worst: 4 x 30ms = 120ms

First-message overhead (route discovery):
  RREQ flood: ~200ms (24 routers across 3 floors)
  RREP return: ~100ms (4-hop unicast)
  Total first message: 300ms + 80ms data = 380ms
  Subsequent messages: 80ms (cached route)

How does multi-hop latency add up in a 3-floor office building? Let’s break down the math for a worst-case 4-hop path.

Per-hop latency components: $ t_{} = t_{} + t_{} + t_{} = 15 + 4 + 1 = 20 $

4-hop end-to-end latency: $ t_{} = 4 = 80 $

First message includes route discovery: $ t_{} = $ $ t_{} = 4 = 80 $ $ t_{} = 200 + 80 + 80 = 360 $

Key insight: First message takes 4.5× longer than subsequent messages. For real-time control, pre-establish routes during network formation.

Step 3: Evaluate Self-Healing Capacity

Routers per floor: 8
Average neighbors per router: 3-4 (overlapping coverage)
Path redundancy: N+2 (each sensor can reach 2-3 routers)

If 1 router fails on Floor 1:
  Affected end devices: ~4 sensors (those closest to failed router)
  Recovery time: 5-15 seconds (RERR + new RREQ/RREP)
  Alternate paths available: 2-3 via neighboring routers
  Message loss during recovery: 1-3 messages (at 30s reporting interval,
    likely zero lost since recovery < reporting interval)

Step 4: Identify the Routing Method

Traffic Pattern Routing Method Why
Sensor → Coordinator Many-to-One 95% of traffic flows to Coordinator; pre-established routes reduce discovery overhead
Coordinator → single sensor AODV on-demand Infrequent commands; route cached from uplink traffic
Firmware update to all Tree Routing + broadcast Hierarchical delivery via address-based forwarding

Decision: Use Many-to-One routing as primary method. The Coordinator periodically sends Route Record Requests, and all 24 routers establish upstream paths. This eliminates RREQ floods for the dominant traffic pattern (sensor data collection), reducing network overhead by approximately 80% compared to pure AODV.

Key Insight: For data-collection IoT networks where traffic is predominantly many-to-one, configuring the Coordinator as the route concentrator dramatically reduces route discovery traffic. Reserve on-demand AODV for the rare downlink commands.

20.9 Routing Best Practices

20.9.1 Network Design

  1. Limit hop count: Design for maximum 5-7 hops
  2. Ensure redundancy: 2-3 routers in range of each device
  3. Place routers strategically: Hallways, central locations
  4. Avoid bottlenecks: Multiple paths between network regions

20.9.2 Monitoring

Track these metrics to identify routing issues:

Metric Healthy Warning Critical
Average hop count 2-3 4-5 6+
Route discovery rate < 1/min 1-5/min > 5/min
Route failures < 1/hour 1-5/hour > 5/hour
Message delivery > 99% 95-99% < 95%

20.9.3 Troubleshooting

Common routing problems and solutions:

Symptom Likely Cause Solution
High latency Too many hops Add routers to reduce hop count
Frequent route changes Marginal links Improve router placement
Devices dropping offline Insufficient redundancy Add backup routers
Recovery too slow Network too large Segment into multiple PANs

Sammy the Sensor needs to send a message: “How does my temperature reading find its way to the Coordinator across a big building?”

Max the Microcontroller explains: “We use AODV routing! When you need to send a message to someone you’ve never talked to, you broadcast a Route Request (RREQ) – like shouting ‘Does anyone know how to reach the Coordinator?’ Every Router passes the shout along until it reaches the destination.”

Lila the LED continues: “Then the Coordinator replies with a Route Reply (RREP) that comes back along the path, like leaving breadcrumbs. Now you know exactly which Routers to use!”

Bella the Battery adds the best part: “And if a Router breaks or gets unplugged? The mesh self-heals! The network notices the broken link and finds a new path around it – like water flowing around a rock in a stream.”

Key ideas for kids:

  • AODV = A way to discover the best path by asking neighbors
  • Route Request (RREQ) = Shouting “How do I get there?” through the network
  • Route Reply (RREP) = The answer coming back with directions
  • Self-healing = Automatically finding a new path when one breaks

20.10 Knowledge Check

Q1: What triggers AODV route discovery in a Zigbee network?

  1. The Coordinator periodically broadcasts routing updates to all devices
  2. A device needs to send data to a destination for which it has no routing entry
  3. Every device discovers routes to all other devices during network formation
  4. Route discovery runs on a fixed 30-second timer

B) A device needs to send data to a destination for which it has no routing entry – AODV is an on-demand (reactive) routing protocol. Routes are only discovered when needed, conserving bandwidth and memory. This differs from proactive protocols that maintain routes to all destinations continuously.

20.11 Knowledge Check

Q2: How does Zigbee’s mesh network self-heal when a Router fails?

  1. The Coordinator immediately assigns a replacement Router
  2. Devices detect the failed link and trigger new AODV route discoveries to find alternate paths
  3. All devices restart and rejoin the network from scratch
  4. End Devices switch to direct communication with the Coordinator

B) Devices detect the failed link and trigger new AODV route discoveries to find alternate paths – When a device fails to deliver a message (no acknowledgment), it marks the route as broken and initiates a new route discovery. The mesh topology ensures multiple paths exist, so traffic reroutes around the failure automatically.

Common Pitfalls

When many devices simultaneously lose their routes (e.g., after network partition resolution), a flood of RREQ messages can saturate the 802.15.4 channel and prevent normal traffic for several seconds. Implement exponential backoff for route discovery retries.

The first packet to a new destination triggers route discovery, adding 100–500 ms before delivery. Applications expecting immediate first-message delivery without triggering discovery should pre-discover routes at startup.

Zigbee routers have fixed routing table sizes (typically 16–32 entries). Dense networks with many unique source-destination pairs overflow routing tables, causing recent entries to evict older ones. Monitor routing table utilization in large deployments.

20.12 Summary

This chapter covered Zigbee routing and self-healing:

  • AODV Protocol: On-demand route discovery saves memory and bandwidth
  • Route Discovery: RREQ broadcasts find paths, RREP unicasts establish routes
  • Route Maintenance: Lifetime timers and RERR messages keep routes valid
  • Self-Healing: Automatic recovery in 5-15 seconds when paths fail
  • Latency: 10-30ms per hop, plus discovery overhead for first messages

Key design principles: - Plan for redundancy (2-3 routers per end device) - Limit maximum hop count to 5-7 - Test self-healing before deployment - Monitor routing metrics in production

20.13 Knowledge Check

::

::

Key Concepts

  • AODV (Ad Hoc On-Demand Distance Vector): The routing algorithm used in Zigbee mesh networks, discovering routes only when needed using RREQ/RREP message flooding.
  • Route Discovery: The process initiated by a Zigbee source node broadcasting RREQ messages to find a path to a destination; RREP messages trace back the route.
  • Route Table Entry: A stored path from a source to a destination in a Zigbee router’s routing table, used to forward future packets without re-running discovery.
  • RREQ (Route Request): A broadcast message initiating Zigbee route discovery; each intermediate router forwards it while recording the reverse path for the subsequent RREP.
  • RREP (Route Reply): A unicast message tracing back from destination to source through the discovered path, establishing routing table entries at each hop.
  • Link Cost: A metric (based on link quality indicator and expected transmissions) used to evaluate path quality during AODV route selection.

20.14 Concept Relationships

Concept Related To How They Connect
AODV Protocol On-Demand Routing Routes discovered only when needed, saving memory and bandwidth
RREQ/RREP Route Discovery Request floods network, Reply establishes path back to source
Route Lifetime Memory Efficiency Routes expire after timeout, freeing table space for active paths
RERR Messages Self-Healing Route Error triggers immediate invalidation and rediscovery
Hop Count Latency Budget 10-30ms per hop determines total end-to-end delay
Many-to-One Routing Sensor Networks Optimized for data collection scenarios where all traffic flows to hub

20.15 What’s Next

Chapter Focus
Zigbee Application Profiles ZHA, ZLL, and Zigbee 3.0 interoperability profiles
Zigbee Security Network-key distribution, Trust Center, and encrypted routing
Zigbee Network Topologies Star, tree, and mesh configurations that underpin routing
Zigbee Network Formation PAN formation, device joining, and address assignment
Zigbee Industrial Deployment Scaling routing design to real-world industrial environments