%% fig-cap: "Northbound and Southbound API Communication Patterns showing REST-based application integration and OpenFlow-based device control"
%% fig-alt: "Diagram showing bidirectional API communication. Top section shows Northbound APIs with three IoT applications (Security, Monitoring, Load Balancer) sending REST/gRPC requests to Controller and receiving JSON responses. Bottom section shows Southbound APIs with Controller sending OpenFlow messages (Flow-Mod, Packet-Out) to switches and receiving OpenFlow messages (Packet-In, Stats-Reply) from switches. Controller in center mediates between both interfaces."
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D'}}}%%
graph TB
subgraph Applications["Applications (Northbound)"]
SecApp["Security App"]
MonApp["Monitoring App"]
LBApp["Load Balancer"]
end
subgraph NorthAPI["Northbound API (REST/gRPC)"]
REST1["POST /firewall/block"]
REST2["GET /topology"]
REST3["POST /path/compute"]
end
Controller["SDN Controller<br/>(Policy Translation)"]
subgraph SouthAPI["Southbound API (OpenFlow)"]
OF1["Flow-Mod"]
OF2["Packet-Out"]
OF3["Packet-In"]
OF4["Stats-Request"]
end
subgraph Switches["Network Devices (Southbound)"]
SW1["Switch 1"]
SW2["Switch 2"]
SW3["Switch 3"]
end
SecApp -->|"REST: Block IP"| REST1
MonApp -->|"REST: Get topology"| REST2
LBApp -->|"gRPC: Compute path"| REST3
REST1 --> Controller
REST2 --> Controller
REST3 --> Controller
Controller -->|"Translate policy<br/>to flow rules"| OF1
Controller --> OF2
Controller --> OF4
OF1 --> SW1
OF1 --> SW2
OF2 --> SW3
OF4 --> SW1
SW1 -->|"Packet-In (no match)"| OF3
SW2 -->|"Stats-Reply"| Controller
SW3 -->|"Barrier-Reply"| Controller
OF3 --> Controller
style Applications fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#fff
style Controller fill:#2C3E50,stroke:#16A085,stroke-width:3px,color:#fff
style Switches fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
style NorthAPI fill:#7F8C8D,stroke:#2C3E50,stroke-width:1px,color:#333
style SouthAPI fill:#7F8C8D,stroke:#2C3E50,stroke-width:1px,color:#333
291 SDN APIs and High Availability
291.1 Learning Objectives
By the end of this chapter, you will be able to:
- Design Controller APIs: Describe northbound (REST/gRPC) and southbound (OpenFlow) API interactions and message types
- Implement High Availability: Apply controller clustering and failover strategies for production deployments
- Configure State Synchronization: Understand Raft consensus, eventual consistency, and distributed transactions for controller clusters
- Plan Switch-Controller Connections: Configure auxiliary connections and failover behavior for redundancy
291.2 Prerequisites
Before diving into this chapter, you should be familiar with:
- SDN Controller Architecture: Understanding controller components and message flow is essential for API design
- SDN Controller Comparison: Knowledge of different controllers helps understand HA implementation differences
- SDN Fundamentals and OpenFlow: OpenFlow protocol basics provide context for southbound API messages
Think of the SDN controller like a restaurant kitchen.
Northbound API = How waiters (applications) communicate with the kitchen - “Table 5 wants the salmon, no nuts” -> “Block IP 10.0.1.50 from the network” - High-level requests, no need to know cooking details
Southbound API = How the head chef (controller) directs line cooks (switches) - “Grill salmon at 400F for 8 minutes” -> “Install flow rule: match src=10.0.1.50, action=DROP” - Precise, technical instructions
Clustering = Having multiple kitchens ready to take over - If the main kitchen catches fire, the backup kitchen continues serving - Customers (network traffic) barely notice the switch - More kitchens = higher reliability, but also more coordination overhead
For IoT specifically: - 99.99% uptime requires clustering (only 52 minutes downtime/year) - Failover time of 3-5 seconds means brief traffic disruption during controller switchover - Most IoT deployments use 3-node clusters (survives 1 failure, manageable overhead)
Deep Dives: - SDN Controller Basics (Overview) - Index of all SDN controller topics - SDN Controller Architecture - Internal components and message flow - SDN Controller Comparison - Comparing OpenDaylight, ONOS, Ryu, Floodlight
Advanced Topics: - SDN Analytics and Implementations - Traffic engineering and network slicing - SDN for IoT: Variants and Challenges - IoT-specific optimizations
Architecture: - Edge-Fog Computing - Distributed control planes
The Myth: Adding more controllers to a cluster always improves network performance and scalability.
The Reality: Controller clustering is about high availability, not raw performance. More controllers can actually decrease performance due to coordination overhead.
Real-World Example:
A smart city deployment tested ONOS controller scaling with 10,000 IoT devices:
- Single controller: 50,000 flow installations/second, 15ms latency
- 3-node cluster: 45,000 flow installations/second, 18ms latency (10% slower)
- 5-node cluster: 38,000 flow installations/second, 25ms latency (24% slower)
Why performance degrades:
- State synchronization overhead: Every flow rule must be replicated to all cluster members (3x the network traffic for 3-node cluster)
- Consensus protocols: Cluster must agree on state changes (Raft/Paxos adds 5-10ms latency)
- Leader election delays: When active controller fails, cluster needs 2-5 seconds to elect new leader
The right approach:
- Use clustering for reliability (99.99% -> 99.9999% uptime), not performance
- Deploy 3 controllers (optimal balance: survives 1 failure, minimal overhead)
- For performance scaling, use controller federation (divide network into domains, each with its own controller)
- Example: Google’s B4 WAN uses federated controllers (one per datacenter) managing 100,000+ devices, rather than a single massive cluster
Bottom Line: A well-tuned single controller often outperforms a poorly configured cluster. Use clustering when availability requirements exceed 99.9%, not as a default performance optimization.
291.3 Northbound and Southbound APIs
The controller acts as a mediator between applications (northbound) and network devices (southbound).
291.4 Northbound APIs
Northbound APIs allow applications to interact with the controller using high-level abstractions.
291.4.1 Common API Types
1. REST API (most common)
- HTTP-based (GET/POST/PUT/DELETE)
- JSON payloads
- Stateless communication
2. gRPC (modern alternative)
- Protocol Buffers for serialization
- Bidirectional streaming
- Lower latency than REST
3. NETCONF (configuration management)
- XML-based
- Transactional operations
- Used for device provisioning
291.4.2 Example REST API Calls
# Get network topology
curl -X GET http://controller:8181/restconf/operational/network-topology:network-topology
# Block traffic from IoT device
curl -X POST http://controller:8181/restconf/operations/firewall:block \
-H "Content-Type: application/json" \
-d '{"source-ip": "10.0.1.50", "action": "drop"}'
# Query flow statistics
curl -X GET http://controller:8181/restconf/operational/opendaylight-inventory:nodes/node/openflow:1/flow-statistics291.4.3 Northbound API Comparison
| Protocol | Latency | Throughput | Complexity | Best For |
|---|---|---|---|---|
| REST | 5-20ms | 10K req/sec | Low | General apps |
| gRPC | 1-5ms | 100K req/sec | Medium | High-perf apps |
| NETCONF | 10-50ms | 1K req/sec | High | Configuration |
291.5 Southbound APIs
Southbound APIs control network devices using standardized protocols.
291.5.1 Primary Protocols
1. OpenFlow (dominant standard)
- Flow-based forwarding
- Version 1.3+ most common in IoT
- Supports meters, groups, multi-table pipelines
2. NETCONF (device configuration)
- Configure device parameters
- Firmware updates
- State retrieval
3. OVSDB (Open vSwitch Database)
- Manage virtual switches
- Port configuration
- Tunnel setup
291.5.2 OpenFlow Message Types
| Message Type | Direction | Purpose | Example Use |
|---|---|---|---|
| Packet-In | Switch -> Controller | No matching flow rule | New IoT device sends first packet |
| Flow-Mod | Controller -> Switch | Install/modify flow rule | “Forward sensor data to analytics server” |
| Packet-Out | Controller -> Switch | Send specific packet | Controller generates ARP reply |
| Stats-Request | Controller -> Switch | Query statistics | “How many bytes on port 5?” |
| Stats-Reply | Switch -> Controller | Statistics response | “Port 5: 1.5 GB, 500K packets” |
| Barrier-Request | Controller -> Switch | Synchronization checkpoint | “Confirm all previous rules installed” |
| Barrier-Reply | Switch -> Controller | Confirmation | “All rules committed” |
291.5.3 OpenFlow Flow-Mod Example
{
"flow": {
"id": "sensor-to-gateway-001",
"table_id": 0,
"priority": 100,
"match": {
"in-port": 1,
"eth-type": "0x0800",
"ipv4-source": "10.0.1.0/24"
},
"instructions": {
"apply-actions": {
"action": [
{"output-action": {"output-node-connector": 5}}
]
}
},
"hard-timeout": 0,
"idle-timeout": 300
}
}291.6 High Availability and Clustering
The Mistake: Expecting that when the active SDN controller fails and a backup takes over, all existing flow rules on switches remain intact and operational. Teams design failover assuming zero traffic disruption.
Why It Happens: Unlike traditional switches that maintain forwarding tables independently, OpenFlow switches may clear flow tables or mark flows as invalid when controller connection is lost, depending on flow timeout settings and switch implementation. The assumption that “data plane continues while control plane recovers” is only partially true.
The Fix: Configure appropriate flow timeouts: use hard timeouts for security-sensitive flows (force re-authentication), but set idle timeouts for stable traffic patterns (keeps rules while traffic flows). Install critical flows as “permanent” (no timeout) via the controller. Test failover scenarios with production-like traffic to measure actual packet loss. Most importantly, configure switches with multiple controller connections (primary + backup) so they can immediately request new master role from backup controller, reducing failover time from 30+ seconds to under 5 seconds.
For production IoT deployments, controller failure is unacceptable. Clustering provides redundancy.
%% fig-cap: "SDN Controller Clustering Architecture showing 3-node cluster with state synchronization and switch connections"
%% fig-alt: "Diagram of controller high availability setup with three controller nodes in a cluster. Controllers labeled Controller-1 (Master), Controller-2 (Slave), and Controller-3 (Slave) are interconnected with bidirectional State Sync arrows. Below, three OpenFlow switches (Switch A, B, C) connect to all three controllers via dashed lines (backup connections) and solid lines (active connections). Failure scenario shown: Controller-1 fails, Controllers 2 and 3 elect new master in 3-5 seconds. Switches reconnect automatically."
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D'}}}%%
graph TB
subgraph Cluster["Controller Cluster (High Availability)"]
C1["Controller-1<br/>(Master)"]
C2["Controller-2<br/>(Slave)"]
C3["Controller-3<br/>(Slave)"]
end
subgraph StateSync["State Synchronization"]
Raft["Raft Consensus<br/>(Leader Election)"]
StateDB["Distributed State DB<br/>(Topology, Flows)"]
end
subgraph Network["Network Infrastructure"]
SW1["Switch A"]
SW2["Switch B"]
SW3["Switch C"]
end
C1 <-->|"State Sync"| C2
C2 <-->|"State Sync"| C3
C3 <-->|"State Sync"| C1
C1 --> Raft
C2 --> Raft
C3 --> Raft
Raft --> StateDB
C1 -.->|"Backup"| SW1
C1 ==>|"Active"| SW2
C1 -.->|"Backup"| SW3
C2 ==>|"Active"| SW1
C2 -.->|"Backup"| SW2
C2 -.->|"Backup"| SW3
C3 -.->|"Backup"| SW1
C3 -.->|"Backup"| SW2
C3 ==>|"Active"| SW3
Failure["Failure Scenario:<br/>Controller-1 fails"]
Recovery["Recovery:<br/>- Controllers 2&3 detect failure<br/>- New master elected (3-5s)<br/>- Switches reconnect automatically"]
Failure -.-> C1
Failure --> Recovery
style C1 fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#fff
style C2 fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff
style C3 fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff
style Cluster fill:#ECF0F1,stroke:#2C3E50,stroke-width:2px
style StateSync fill:#ECF0F1,stroke:#16A085,stroke-width:2px
style Failure fill:#E74C3C,stroke:#2C3E50,stroke-width:2px,color:#fff
style Recovery fill:#27AE60,stroke:#2C3E50,stroke-width:2px,color:#fff
291.7 Clustering Strategies
291.7.1 1. Active-Standby (Simplest)
- One controller active, others on standby
- Standby takes over if active fails
- Failover time: 5-30 seconds (heartbeat detection + state recovery)
- Advantage: Simple, no state conflicts
- Disadvantage: Standby resources wasted
291.7.2 2. Active-Active (ONOS approach)
- All controllers active, network partitioned across controllers
- Each controller manages subset of switches
- Failover time: 3-5 seconds (just reassign switches)
- Advantage: Better resource utilization
- Disadvantage: Complex state synchronization
291.7.3 3. Distributed Hash Table (ODL approach)
- Network state distributed across cluster using consistent hashing
- Each controller owns portion of state space
- Failover time: 2-5 seconds
- Advantage: Scales to large clusters
- Disadvantage: More complex programming model
291.7.4 Clustering Comparison
| Strategy | Failover Time | Resource Efficiency | Complexity | Best For |
|---|---|---|---|---|
| Active-Standby | 5-30s | 50% (standby idle) | Low | Simple HA |
| Active-Active | 3-5s | 90%+ | Medium | Production |
| DHT-based | 2-5s | 85%+ | High | Large scale |
291.8 State Synchronization
Controllers in a cluster must maintain consistent view of network state.
291.8.1 What Needs Synchronization
- Topology: Which devices are connected, which links are active
- Flow rules: What forwarding rules are installed on each switch
- Statistics: Traffic counts, port status
- Application data: Custom state maintained by applications
291.8.2 Synchronization Mechanisms
1. Raft consensus (ONOS uses this)
- Leader election ensures one controller makes decisions
- All state changes replicated to followers
- Guarantees consistency but adds latency (~5-10ms)
2. Eventual consistency (Cassandra-style)
- Changes propagate asynchronously
- Faster but risk of temporary inconsistencies
- Acceptable for non-critical data (statistics)
3. Distributed transactions (ODL MD-SAL)
- Two-phase commit for critical operations
- Strong consistency guarantee
- Higher latency (~10-20ms)
291.8.3 Consistency vs Performance Tradeoffs
| Mechanism | Consistency | Latency | Use Case |
|---|---|---|---|
| Raft | Strong | +5-10ms | Flow rules, topology |
| Eventual | Weak | +1-2ms | Statistics, counters |
| 2PC | Strong | +10-20ms | Cross-domain operations |
291.9 Switch-Controller Connections
Switches can connect to multiple controllers for redundancy.
291.9.1 OpenFlow Auxiliary Connections
Switch configuration:
- Primary controller: 10.0.0.1:6653
- Backup controller: 10.0.0.2:6653
- Backup controller: 10.0.0.3:6653
Behavior:
- Switch connects to all three controllers
- Primary controller is "master" (can modify flow tables)
- Backup controllers are "slave" (read-only access)
- If master fails, switch promotes backup to master
291.9.2 Failover Behavior
- Switch detects master failure (TCP connection drops or echo timeout)
- Switch sends OFPT_ROLE_REQUEST to backup controller: “Please become master”
- Backup controller (now master) sends OFPT_ROLE_REPLY: “I accept”
- Switch now accepts Flow-Mod from new master
- Total time: 3-10 seconds depending on echo interval
291.9.3 Controller Role Configuration
| Role | Flow-Mod | Stats | Packet-In | Use Case |
|---|---|---|---|---|
| Master | Yes | Yes | Yes | Primary controller |
| Slave | No | Yes | Optional | Backup, monitoring |
| Equal | Yes | Yes | Yes | Load balancing (rare) |
291.10 Selecting a Controller for IoT
Choosing the right controller depends on your deployment requirements.
291.10.1 Decision Matrix
| Requirement | Recommended Controller | Rationale |
|---|---|---|
| Learning/Education | Ryu | Python, simplest API, best tutorials |
| Prototype/PoC | Ryu or Floodlight | Quick setup, good performance |
| Enterprise deployment | OpenDaylight | Comprehensive features, multi-protocol |
| High availability critical | ONOS | Best clustering, carrier-grade reliability |
| High performance | ONOS or Floodlight | 1M+ and 600K flows/sec respectively |
| Large scale (10K+ devices) | ONOS | Designed for scalability |
| Mixed network (IoT + legacy) | OpenDaylight | Supports most protocols |
| Cloud-native deployment | ONOS | Microservices architecture |
| On-premises/embedded | Ryu or Floodlight | Lightweight, lower resource usage |
291.10.2 Real-World Example: AT&T Domain 2.0 Migration
Background: AT&T’s global network carries 197 petabytes daily across 135,000 route miles serving 340M+ connections. Traditional hardware-based network required 18-36 months to deploy new services.
SDN Migration (2013-2020): - Controller Platform: ONOS (Open Network Operating System) - Scale: 75% of network traffic virtualized by 2020 - Switches Managed: 65,000+ virtual and physical switches - Control Plane Instances: 500+ ONOS controller clusters (3-5 nodes each)
Results: - Service Deployment Time: 18 months -> 90 minutes (99.7% reduction) - Network Efficiency: 40-60% cost savings through software-defined routing - Reliability: 99.999% uptime (5.26 minutes downtime/year) despite centralized control - OPEX Reduction: $2B+ annual savings through automation and dynamic optimization
Key Technical Achievements: - Flow Rule Scale: Controllers manage 500K+ flow rules per cluster with <10ms flow setup latency - Failover Time: <2 seconds controller cluster failover with zero packet loss - Multi-Tenancy: Network slicing supports 50+ business units on shared infrastructure - Dynamic Routing: Real-time traffic engineering reroutes around congestion in <5 seconds vs. 15-30 minutes with traditional OSPF
IoT Implications: AT&T’s success demonstrates SDN scales to carrier-grade deployments with millions of endpoints. For IoT, key lessons include: - Proactive Flow Installation: Pre-install rules for known traffic patterns (sensors -> gateway) to avoid PACKET_IN overhead - Controller Clustering: 3-5 node clusters provide high availability without sacrificing performance - Hierarchical Control: Regional controllers manage local switches, report to centralized orchestrator - Policy-Based Management: Define high-level policies (“prioritize emergency services”) rather than per-device rules
291.11 Knowledge Check
Test your understanding of SDN APIs and high availability.
291.12 Summary
Key Takeaways:
- Northbound APIs (REST/gRPC) allow applications to program the network without OpenFlow knowledge
- Southbound APIs (OpenFlow/NETCONF) control network devices with standardized protocols
- OpenFlow messages (Packet-In, Flow-Mod, Stats-Request) form the control loop between controller and switches
- Clustering provides high availability (99.99%+) but at performance cost (10-25% slower)
- State synchronization uses Raft consensus for strong consistency or eventual consistency for performance
- Switch-controller connections with multiple controllers enable sub-10s failover
Practical Guidelines:
- Use REST APIs for application integration, not direct OpenFlow manipulation
- Deploy 3-node clusters for 99.99%+ uptime requirements
- Configure switches with multiple controller connections (primary + 2 backups)
- Use proactive flow installation for known IoT traffic patterns
- Set appropriate flow timeouts: permanent for critical flows, idle-timeout for dynamic traffic
291.13 Visual Reference Gallery
This diagram illustrates the fundamental shift in network architecture that SDN represents, separating the control plane from the data plane.
This visualization shows the three-layer SDN architecture with applications, controller, and infrastructure, along with the APIs connecting them.
SDN enables intelligent data aggregation and traffic engineering, optimizing how IoT data flows through the network to cloud destinations.
291.14 What’s Next
Now that you understand APIs and high availability:
- SDN for IoT: Variants and Challenges: Explore IoT-specific SDN optimizations (SD-IoT, resource-constrained adaptations)
- SDN Analytics and Implementations: Learn traffic engineering, network slicing, and real deployments
- SDN Controller Basics (Overview): Return to the index for additional resources
Hands-on Practice:
- Configure ONOS 3-node cluster and test failover
- Write REST API client to query topology and install flows
- Simulate controller failure with Mininet and measure recovery time