145  SOA API Design & Discovery

In 60 Seconds

IoT APIs need URL-path versioning (e.g., /v2/devices) for breaking changes and header versioning for minor updates. Rate limit at 100-1000 req/sec per client with 429 responses and Retry-After headers. Service discovery uses either client-side (Consul/etcd lookup, lower latency) or server-side (load balancer routing, simpler clients) – choose client-side for edge devices that cache, server-side for constrained devices.

145.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Architect Resilient APIs: Implement versioning strategies, rate limiting, and backward compatibility for IoT service interfaces
  • Implement Service Discovery: Configure dynamic service registration and discovery for scalable IoT deployments
  • Select Discovery Patterns: Justify the choice between client-side and server-side discovery based on deployment requirements

An API (Application Programming Interface) is like a menu at a restaurant – it lists what services are available and how to order them. In IoT, APIs let devices and applications request data or trigger actions in a standard way. Good API design means any new device can easily plug into your system without custom coding.

145.2 Prerequisites

Before diving into this chapter, you should be familiar with:

APIs are like restaurant menus - they tell you what you can order and how to ask for it!

145.2.1 The Sensor Squad Adventure: The Menu Problem

When the Sensor Squad’s pizza restaurant got popular, they had a problem. Customers kept asking for things in different ways:

  • “I want a large pepperoni!”
  • “Give me pizza, big one, with the red meat circles!”
  • “PIZZA NOW! MEAT!”

Sunny the Order Taker got confused and made mistakes. So they created a menu (that’s an API!):

PIZZA MENU v1:

  • Pizza size: Small, Medium, Large
  • Toppings: Pepperoni, Mushroom, Cheese
  • How to order: “I’d like a [SIZE] pizza with [TOPPING]”

Now everyone knew exactly how to ask! And when they added new pizzas, they made MENU v2 that still worked with the old way of ordering.

145.2.2 Key Words for Kids

Word What It Means
API A menu that tells computers how to ask for things
Version Like “Menu v1” and “Menu v2” - newer versions with more options
Service Discovery Finding which kitchen is open and ready to cook

145.3 API Design and Versioning

APIs are the contracts between services. Poor API design creates tight coupling and painful migrations.

145.3.1 RESTful API Design for IoT

Resource-Oriented Design:

# Good: Resource-oriented
GET    /devices                    # List devices
GET    /devices/{id}               # Get specific device
POST   /devices                    # Create device
PUT    /devices/{id}               # Update device
DELETE /devices/{id}               # Delete device
GET    /devices/{id}/telemetry     # Get device telemetry
POST   /devices/{id}/commands      # Send command to device

# Bad: RPC-style
POST   /getDevices
POST   /createDevice
POST   /updateDevice
POST   /deleteDevice
POST   /getDeviceTelemetry
POST   /sendDeviceCommand

IoT-Specific Considerations:

Concern REST Best Practice IoT Adaptation
Large payloads Pagination Streaming for time-series
Real-time updates Polling WebSocket/SSE subscriptions
Batch operations Multiple calls Bulk endpoints
Binary data Base64 encoding Protobuf/CBOR

145.3.2 API Versioning Strategies

Three API versioning strategies: URL path versioning, header versioning, and query parameter versioning with trade-offs
Figure 145.1: Three API versioning approaches with different trade-offs
Strategy Example Pros Cons
URL Path /v1/devices Clear, cacheable, easy routing URL pollution, version in every link
Header Accept: application/vnd.iot.v1+json Clean URLs Hidden, harder to test
Query Param /devices?version=1 Easy to add Can break caching, looks hacky

Recommendation for IoT: URL path versioning (/v1/, /v2/) is most practical. IoT devices often have limited HTTP customization and need simple, predictable endpoints.

145.3.3 Backward Compatibility Rules

Safe Changes (Non-Breaking):

  • Adding new optional fields
  • Adding new endpoints
  • Adding new optional query parameters
  • Making required fields optional

Breaking Changes (Require New Version):

  • Removing fields
  • Renaming fields
  • Changing field types
  • Changing URL structure
  • Making optional fields required

145.4 Service Discovery

In dynamic environments, services need to find each other without hardcoded addresses.

145.4.1 The Service Discovery Problem

Service discovery flow diagram showing service registration, discovery lookup, and invocation steps
Figure 145.2: Service discovery flow: registration, discovery, and invocation

145.4.2 Discovery Patterns

1. Client-Side Discovery

The client queries the registry and chooses an instance:

# Client-side discovery example
import consul

def get_telemetry_service():
    c = consul.Consul()
    services = c.catalog.service('telemetry-service')[1]
    # Client chooses instance (round-robin, random, etc.)
    instance = random.choice(services)
    return f"http://{instance['Address']}:{instance['ServicePort']}"

2. Server-Side Discovery

A load balancer handles discovery:

Server-side discovery pattern with load balancer abstracting service locations from client applications
Figure 145.3: Server-side discovery: Load balancer abstracts service location from clients

145.4.3 Service Registry Options

Registry Protocol Strengths IoT Use Case
Consul HTTP/DNS Health checks, KV store Multi-datacenter IoT
etcd gRPC Strong consistency Kubernetes-native
Eureka HTTP Netflix ecosystem Spring Boot IoT
Kubernetes DNS DNS Built into K8s Container-native

145.5 Worked Example: Designing an API for a Fleet Management Platform

Scenario: A logistics company operates 2,000 delivery trucks with IoT telematics devices (GPS, fuel sensor, engine diagnostics, door sensor). The devices run on cellular connections with varying bandwidth (3G/4G) and must report to a central platform. The engineering team needs to design an API that handles real-time tracking, historical queries, and remote commands.

Step 1: Define Resource Model

Map physical entities to REST resources:

/v1/vehicles                          # 2,000 trucks
/v1/vehicles/{vin}/telemetry          # Real-time location, speed, fuel
/v1/vehicles/{vin}/telemetry/history  # Historical data with time range
/v1/vehicles/{vin}/diagnostics        # OBD-II fault codes
/v1/vehicles/{vin}/commands           # Remote lock/unlock, engine cutoff
/v1/geofences                         # Virtual boundaries for alerts
/v1/alerts                            # Active alerts (speeding, geofence breach)

Step 2: Handle Bandwidth Constraints

Delivery trucks on 3G connections cannot afford large payloads. Design compact responses:

Endpoint Full Response Compact Response (3G) Savings
GET /vehicles/{vin}/telemetry 2.1 KB JSON 340 bytes CBOR 84%
POST /vehicles/{vin}/telemetry (batch 60 readings) 126 KB JSON 18 KB Protobuf 86%
GET /vehicles (list 2,000) 480 KB 52 KB (pagination, 50/page) 89%

Implementation: Accept header negotiation (Accept: application/cbor for constrained devices, Accept: application/json for dashboards).

Step 3: Version the Critical Breaking Change

After 6 months, the team needs to change the telemetry schema: the location field must change from a flat {lat, lng} to GeoJSON {type: "Point", coordinates: [lng, lat]} for geofence compatibility. This is a breaking change (field restructured, coordinate order reversed).

Migration plan: - Week 1: Deploy /v2/vehicles/{vin}/telemetry alongside /v1/ - Week 2-8: Push firmware updates to trucks (OTA, 200 trucks/week) - Week 9: Add Sunset: Sat, 01 Jun 2026 00:00:00 GMT header to /v1/ responses - Week 12: Monitor /v1/ traffic – still 180 trucks on v1 (OTA failures) - Week 14: Force-update remaining 180 trucks during depot visits - Week 16: Decommission /v1/telemetry endpoint after confirming zero active clients

Step 4: Configure Service Discovery for Multi-Region

The platform runs in 3 AWS regions (us-east-1, eu-west-1, ap-southeast-1) to minimize latency for trucks in North America, Europe, and Southeast Asia.

Consul configuration for geo-aware routing: - Each region runs a Consul datacenter with local service instances - Trucks perform DNS lookup: telemetry.service.consul resolves to nearest regional endpoint - Health checks verify API response time < 200 ms; unhealthy instances are deregistered within 30 seconds - If an entire region fails, Consul’s prepared queries route traffic to the next-nearest region (failover latency: 50-150 ms additional)

Step 5: Rate Limiting to Protect the Platform

A firmware bug caused 300 trucks to retry failed GPS uploads every 100 ms (instead of every 60 seconds), generating 3,000 requests/second – 15x the normal rate.

Rate limiting configuration that prevented outage:

Client Type Rate Limit Burst Response on Exceed
Telematics device 2 req/sec sustained 10 req burst 429 + Retry-After: 30
Dashboard user 20 req/sec 50 req burst 429 + Retry-After: 5
Internal service 500 req/sec 1,000 req burst 429 + Retry-After: 1

Result: The 300 buggy trucks were throttled to 600 total req/sec (2 each), protecting the remaining 1,700 trucks and all dashboard users. Without rate limiting, the 3,000 req/sec burst would have saturated the API gateway, causing timeouts for all 2,000 vehicles.

Outcome: The API serves 2,000 trucks across 3 regions with 99.95% uptime over 12 months. The v1-to-v2 migration completed in 16 weeks with zero data loss. Average API response time is 45 ms (same-region) and 180 ms (cross-region failover).

Rate limiting prevents API overload during traffic spikes. For 2,000 telematics devices polling every 60 seconds, calculate capacity:

\[\text{Base load} = \frac{N_{devices}}{T_{interval}} = \frac{2000}{60} \approx 33.3 \text{ req/s}\]

Burst calculation: If all devices retry simultaneously after a 5-minute outage:

\[\text{Burst} = N_{devices} = 2000 \text{ req/s (thundering herd)}\]

Rate limit design: Per-device limit of 2 req/sec prevents runaway retries:

\[\text{Capped burst} = N_{devices} \times 2 = 4000 \text{ req/s (manageable)}\]

Without rate limiting, the thundering herd (2000 req/s) could saturate an API gateway rated for 5,000 req/s, leaving only 60% capacity for normal traffic. Rate limits at 2 req/s per device cap the burst at 4,000 req/s, preserving 20% headroom.

Scenario: An IoT platform serves 50,000 smart home devices with an API gateway. Calculate appropriate rate limits to protect against accidental DDoS while allowing normal operation.

Given:

  • 50,000 devices: 40,000 sensors (periodic telemetry), 10,000 actuators (command/response)
  • Normal traffic: sensors report every 60s, actuators respond to commands (avg 10/hour per device)
  • API gateway capacity: 10,000 req/sec sustained, 20,000 req/sec burst (30s)
  • Firmware bug risk: retry storm (device retries every 100ms instead of 60s)

Step 1: Calculate Normal Load

  • Sensor load: 40,000 devices / 60s = 667 req/sec
  • Actuator load: 10,000 devices × 10 cmd/hr / 3600s = 28 req/sec
  • Total normal: 695 req/sec (7% of capacity) ✓ Healthy

Step 2: Estimate Burst Traffic (Peak Hour)

  • Peak is 3x normal (smart home morning routine)
  • Peak load: 695 × 3 = 2,085 req/sec (21% of capacity) ✓ Still comfortable

Step 3: Calculate Buggy Firmware Impact (No Rate Limiting)

  • If 1% of devices (500) have buggy firmware retrying every 100ms:
  • Buggy device load: 500 × (1000ms / 100ms) = 5,000 req/sec
  • Combined with normal: 5,000 + 695 = 5,695 req/sec (57% capacity)
  • If 5% buggy (2,500 devices): 25,695 req/secGateway saturated and crashes

Step 4: Design Per-Device Rate Limit

  • Normal max: sensors report every 60s → 1 req/sec is generous headroom
  • Burst allowance: 5 requests (connection setup, retries, multiple readings)
  • Per-device limit: 2 req/sec sustained, 10 req burst

Step 5: Validate Rate Limit Effectiveness

  • 500 buggy devices capped at 2 req/sec each = 1,000 req/sec (not 5,000)
  • Combined with normal: 1,000 + 695 = 1,695 req/sec (17% capacity) ✓ Protected
  • 2,500 buggy devices capped: 5,000 req/sec + 695 = 5,695 req/sec (57%) ✓ Gateway survives

Step 6: Design Global Rate Limit (Defense in Depth)

  • Per-tenant (building) limit: 200 devices/building × 2 req/sec = 400 req/sec per tenant
  • Global limit: 8,000 req/sec (80% of capacity, reserve 20% for burst)

Configuration:

rate_limits:
  per_device:
    sustained: 2    # req/sec
    burst: 10       # requests in 10-second window
    action: "429_with_retry_after"
  per_tenant:
    sustained: 400  # req/sec
    burst: 1000     # requests in 10-second window
    action: "429_with_retry_after"
  global:
    sustained: 8000  # req/sec (80% capacity)
    burst: 15000     # allow 15k/sec for 30s
    action: "503_service_unavailable"

Result: With rate limiting, even 10% buggy devices (5,000) only consume 67% capacity instead of crashing the gateway. The remaining 45,000 healthy devices operate normally.

Choose the right versioning approach based on client capabilities and deployment constraints.

Criterion URL Path /v2/devices Header Accept: v2 Query Param ?v=2
Cacheability ✓ Excellent (URL is cache key) ✗ Poor (headers often ignored by proxies) ⚠ Moderate (query params cached separately)
Client Simplicity ✓ Trivial (just change URL) ✗ Complex (must set custom header) ✓ Simple (append param)
API Gateway Routing ✓ Trivial (route by path prefix) ⚠ Moderate (inspect headers) ⚠ Moderate (inspect query string)
Documentation Clarity ✓ Visible in URL ✗ Hidden (must read docs) ⚠ Visible but looks hacky
URL Pollution ✗ Every endpoint has /v1/, /v2/ prefix ✓ Clean URLs ✗ Every URL has ?v=N suffix
IoT Device Support ✓ Works on all HTTP libraries ✗ Some IoT devices can’t customize headers ✓ Works on all HTTP libraries

Recommendation by Use Case:

Use Case Best Choice Why
IoT devices (ESP32, Arduino) URL Path Simple HTTP libraries often can’t customize headers
Mobile apps (iOS/Android) URL Path or Header Both work; URL path easier to cache
Web dashboards (SPA) Header Keeps URLs clean for browser history
Internal microservices Header Reduces URL clutter in service mesh
Public API for third-party developers URL Path Most intuitive for documentation and testing

Mixing Strategies (Deprecated): Some APIs accept both (e.g., /v2/devices OR /devices with Accept: v2). This adds complexity and testing surface. Choose ONE strategy and stick with it.

Common Mistake: No Sunset Date for Deprecated API Versions

The Problem: An IoT platform maintained 5 API versions (v1 through v5) simultaneously for 3 years, supporting edge cases from 50 devices still on v1. The cost: maintaining authentication, database schemas, and test suites for all 5 versions consumed 40% of engineering time, blocking new features.

Why It Happens:

  • Fear of breaking existing devices (“we can’t force customers to upgrade”)
  • No sunset policy defined upfront
  • No telemetry on API version usage (didn’t know v1 had only 50 users)

The Solution: Sunset Policy with Gradual Migration

Year 1: Launch v2 with Sunset Header

HTTP/1.1 200 OK
Sunset: Sat, 31 Dec 2024 23:59:59 GMT
Deprecation: true
Link: <https://api.example.com/docs/migration-v1-to-v2>; rel="deprecation"

The Sunset header tells clients v1 will shut down in 12 months. Well-behaved clients log warnings and notify operators.

Months 1-3: Monitor and Notify

-- Track v1 usage daily
SELECT device_id, COUNT(*) AS v1_requests
FROM api_logs
WHERE api_version = 'v1' AND date > NOW() - INTERVAL '7 days'
GROUP BY device_id
ORDER BY v1_requests DESC;

Send email to customers: “Your devices X, Y, Z are using deprecated v1. Migrate by Dec 31.”

Months 4-9: Active Migration

  • Offer free firmware updates for v1 devices
  • Provide migration scripts for custom integrations
  • Track migration progress (50 devices → 25 → 10 → 5)

Month 10: Final Warning + Whitelist Exception

  • 5 devices remain on v1 (e.g., embedded industrial hardware, can’t update firmware)
  • Contact customers: “v1 shutting down in 60 days. Upgrade or request exception.”
  • 3 customers upgrade, 2 request exception (critical infrastructure, planned replacement in 2025)

Month 12: Sunset with Exception

  • v1 shut down for all devices except whitelisted 2
  • Whitelisted devices routed to compatibility shim (v1 API → v2 backend with translation layer)
  • Shim has 1/10th the maintenance cost of full v1 support

Months 13+: Enforce

  • v1 requests return 410 Gone (permanent shutdown, not temporary 503)
HTTP/1.1 410 Gone
Content-Type: application/json

{
  "error": "API v1 was sunset on 2024-12-31",
  "migration_guide": "https://api.example.com/docs/migration-v1-to-v2",
  "support_email": "api-support@example.com"
}

Result: Engineering time for API maintenance dropped from 40% to 8% (only v4 and v5 actively maintained). The 2 exception devices supported via lightweight shim until hardware replacement in 2025. Total cost of 3-year sunset: 80 hours engineering time (mostly communication and migration scripting).

Key Lesson: Define sunset policy BEFORE launching new versions. Typical IoT API lifespan: 2 years before deprecation, 1 year grace period, then sunset. Maintain only N and N-1 versions actively.

Common Pitfalls

IoT devices in the field may run for 5-10 years without firmware updates. Breaking API changes (renaming fields, changing response structures) will brick old devices. Always version APIs from day one (URL path versioning: /api/v1/), maintain at least two versions simultaneously, and use Sunset headers to give devices advance notice before deprecation.

Key Concepts
  • REST API: A Representational State Transfer API using HTTP verbs (GET, POST, PUT, DELETE) on resource-oriented URLs, the dominant pattern for IoT device management and data retrieval interfaces
  • API Versioning: The practice of including version identifiers (v1, v2) in API URLs or headers to allow breaking changes without disrupting existing clients – critical for IoT where devices may not update firmware quickly
  • Rate Limiting: An API protection mechanism that caps request rates per client (e.g., 100 requests/minute per device), preventing individual devices or rogue clients from overwhelming the IoT backend
  • Service Discovery: The mechanism by which IoT services dynamically locate each other’s network addresses without hardcoded configuration – implemented via DNS-SD, Consul, or Kubernetes service registries
  • API Gateway: A reverse proxy that centralizes cross-cutting concerns (authentication, rate limiting, SSL termination, routing) for multiple backend IoT services, providing a single entry point for devices and applications
  • OpenAPI Specification: A machine-readable YAML/JSON contract describing an API’s endpoints, parameters, and schemas – enables automatic client SDK generation and documentation for IoT device firmware teams
  • gRPC: A high-performance RPC framework using Protocol Buffers for serialization, offering 5-10x smaller payloads than JSON REST – preferred for high-frequency device-to-cloud communication in resource-constrained IoT
  • Idempotent Operation: An API operation that produces the same result regardless of how many times it is called – essential for IoT where network retries may cause duplicate requests, preventing double-execution of actuator commands

An API where a dashboard must make 50 individual device-state calls to render one page creates 50 sequential HTTP connections, multiplying latency. Design aggregate endpoints (/api/v1/fleet/status) that return bulk data in one call. For IoT devices sending frequent sensor readings, batch endpoints (POST /readings with arrays) reduce per-message overhead by 10-100x.

IoT APIs exposed without authentication or rate limiting during development often remain insecure in production due to ‘we’ll add it later’ deferral. Implement API keys or OAuth2 client credentials and rate limiting from the first commit. An unprotected IoT API endpoint discovered by a scanner can generate millions of requests per hour, causing service outages.

145.6 Summary

This chapter covered API design and service discovery for IoT platforms:

  • RESTful API Design: Use resource-oriented design with consistent naming and HTTP methods
  • API Versioning: URL path versioning (/v1/, /v2/) is most practical for IoT with limited HTTP customization
  • Backward Compatibility: Add fields freely, remove fields only in new major versions
  • Service Discovery: Client-side (Consul) for multi-region, server-side (K8s DNS) for single-cluster
Key Takeaway

In one sentence: Well-designed APIs with clear versioning strategies and dynamic service discovery enable IoT platforms to evolve without breaking deployed devices.

Remember this rule: Adding new optional fields is always backward compatible; removing or renaming fields requires a new API version.

145.7 Knowledge Check

Challenge: Design an API versioning strategy for a fleet management platform with these constraints:

Context:

  • 2,000 delivery trucks with embedded telematics devices (firmware update cycle: 6 months)
  • 5 third-party logistics apps consuming your API
  • Regulatory requirement: 7-year data retention
  • Team: 12 developers across 3 services

Scenario: You need to make a breaking change to the /vehicles/{vin}/telemetry endpoint: - Current: Flat {lat, lng} coordinates - New: GeoJSON {type: "Point", coordinates: [lng, lat]} - Reason: Geofence compatibility with industry standards

Tasks:

  1. Choose versioning strategy - URL path vs Header vs Query param?
  2. Design migration timeline - How long to support both v1 and v2?
  3. Calculate impact - How many devices must update firmware?
  4. Create sunset policy - What warnings/timelines for deprecation?
  5. Handle exceptions - What if 50 devices cannot update (embedded hardware)?

What to observe:

  • Does URL path versioning simplify client implementation for constrained devices?
  • Is 6 months enough migration time given firmware update cycles?
  • How do you handle the 50 un-updatable devices - gateway translation layer?

Deliverable: API migration plan with version lifecycle, sunset headers, client notification strategy.

145.9 What’s Next

If you want to… Read this
Build fault-tolerant IoT services with circuit breakers and bulkheads SOA Resilience Patterns
Deploy and scale IoT APIs using containers and Kubernetes SOA Container Orchestration
Understand SOA and microservices service boundary design SOA and Microservices Fundamentals
Apply architecture patterns using IoT reference models IoT Reference Models and Patterns
Learn MQTT as the primary IoT device communication protocol MQTT Fundamentals