44  Data Formats Practice and Assessment

44.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Solve format selection scenarios: Apply decision frameworks to real-world IoT cases
  • Calculate payload sizes: Design efficient binary encodings for specific sensor data
  • Debug encoding issues: Identify and fix common serialization problems
  • Design migration strategies: Plan format upgrades without service disruption
  • Justify recommendations: Defend format choices with quantitative analysis

This is part of a series on IoT Data Formats:

  1. IoT Data Formats Overview - Introduction and text formats
  2. Binary Data Formats - CBOR, Protobuf, custom binary
  3. Data Format Selection - Decision guides and real-world examples
  4. Data Formats Practice (this chapter) - Scenarios, quizzes, worked examples

44.2 Prerequisites

Before starting this chapter, you should have completed:


44.3 Knowledge Check Scenarios

Test your understanding of IoT data formats with these scenario-based questions.

Situation: You’re deploying 10,000 smart electricity meters across a city using NB-IoT cellular connectivity. Each meter sends readings every 15 minutes, including:

  • Current power consumption (float)
  • Voltage (float)
  • Meter ID (8-character string)
  • Timestamp (Unix epoch)

The cellular plan costs $0.02 per MB. You’re evaluating JSON vs CBOR vs Protocol Buffers.

Question: Which format would you choose, and why? Calculate the annual data cost difference between JSON and your recommended format.

Question: Using the size estimates given, what is the approximate annual savings of Protobuf compared to JSON?

Explanation: B. JSON = $526/year and Protobuf = $140/year, so savings = $526 - $140 = $386/year.

Recommended: Protocol Buffers (or CBOR)

Calculation:

  • Messages per meter per year: 4/hour x 24 hours x 365 days = 35,040 messages
  • Total messages: 35,040 x 10,000 meters = 350.4 million messages/year

Format size estimates for this payload:

  • JSON: ~75 bytes (field names, quotes, braces)
  • CBOR: ~40 bytes (binary encoding, field names still present)
  • Protobuf: ~20 bytes (field numbers instead of names)

Annual data costs:

  • JSON: 350.4M x 75 bytes = 26.3 GB/year = $526/year
  • CBOR: 350.4M x 40 bytes = 14.0 GB/year = $280/year
  • Protobuf: 350.4M x 20 bytes = 7.0 GB/year = $140/year

Savings with Protobuf vs JSON: $386/year (73% reduction)

Why Protobuf over CBOR? At this scale (10,000 devices), the additional upfront schema definition cost is justified by:

  1. Stronger typing prevents data corruption bugs
  2. Efficient code generation for both meters and cloud
  3. Schema evolution handled gracefully
  4. Industry standard for high-volume IoT pipelines

Situation: You’re adding a new soil nutrient sensor to an existing LoRaWAN deployment. The sensor measures:

  • pH level (0-14, one decimal precision)
  • Nitrogen content (mg/kg, integer 0-1000)
  • Phosphorus content (mg/kg, integer 0-1000)
  • Potassium content (mg/kg, integer 0-1000)
  • Soil temperature (C, one decimal, -20 to 60)

The LoRaWAN gateway is at the edge of its range (SF12), limiting payload to 51 bytes maximum. Readings are sent every 4 hours.

Question: Can you use JSON for this sensor? If not, design a custom binary format that fits within the 51-byte limit.

Question: In the proposed binary layout, how many bytes are needed for the five measurements (excluding device ID/timestamp)?

Explanation: B. pH (1) + N (2) + P (2) + K (2) + temp (2) = 9 bytes.

JSON is risky and not recommended here.

Sample JSON: {"pH":7.2,"N":450,"P":320,"K":280,"temp":18.5} = 48 bytes

This barely fits, but:

  1. No room for error (one more decimal place breaks it)
  2. No room for device ID or timestamp
  3. JSON parsing on constrained MCU is expensive

Custom Binary Format (9 bytes total):

Field Encoding Bytes Range Notes
pH uint8 (value x 10) 1 0-140 -> 0.0-14.0 Divide by 10 to decode
Nitrogen uint16 2 0-1000 Direct value
Phosphorus uint16 2 0-1000 Direct value
Potassium uint16 2 0-1000 Direct value
Temperature int16 (C x 10) 2 -200 to 600 -> -20.0 to 60.0C Divide by 10 to decode
Total 9 bytes ~81% smaller than JSON

Benefits:

  • 9 bytes vs 48 bytes = 81% bandwidth savings
  • 42 bytes remaining for device ID, timestamp, battery voltage
  • Lower transmission energy = longer battery life
  • Simpler parsing on constrained MCU

Alternative: CBOR would be ~20 bytes, which also fits and provides more flexibility than custom binary.

Situation: Your company has 50,000 deployed sensors sending JSON over MQTT. Management wants to reduce cellular data costs by 50%. The sensors have limited flash memory (32KB available for firmware updates) and are deployed in remote locations (firmware updates are expensive to verify).

Question: What format would you migrate to, and what is your migration strategy to avoid bricking devices or losing data during the transition?

Question: Which format is the best migration target if you want a smaller payload than JSON, minimal schema coordination, and a small library footprint?

Explanation: B. CBOR keeps a JSON-like data model but encodes it compactly, and the cloud can decode without strict schema synchronization during rollout.

Recommended Format: CBOR

Why CBOR over Protobuf for this migration:

  1. Same data model as JSON - minimal code changes required
  2. Self-describing - cloud can decode without schema synchronization
  3. Smaller library footprint - fits in 32KB flash constraint
  4. No breaking changes - can decode both JSON and CBOR during transition

Migration Strategy (Zero-Downtime):

Phase 1: Cloud Preparation (Week 1)

  • Update cloud ingest to accept both JSON and CBOR
  • Add content-type header detection (application/json vs application/cbor)
  • Validate CBOR decoding matches JSON semantics

Phase 2: Gradual Firmware Rollout (Weeks 2-8)

  • Push firmware update to 1% of devices (500 sensors)
  • Monitor for decode errors, data quality issues
  • If successful, expand to 10%, 50%, 100%
  • New firmware sends CBOR but can fallback to JSON on decode error response

Phase 3: Transition Monitoring (Weeks 8-12)

  • Track ratio of JSON vs CBOR messages
  • Identify any devices that failed to update
  • Cloud continues accepting both formats indefinitely for stragglers

Phase 4: Cost Verification

  • Measure actual bandwidth reduction (target: 40-50%)
  • Calculate ROI: savings vs firmware update costs

Critical Success Factors:

  • Never remove JSON support from cloud (some devices may never update)
  • Include version field in firmware to track migration status
  • Have rollback plan if CBOR library causes stability issues

Situation: A factory floor has 200 vibration sensors monitoring machine health. Each sensor samples at 1 kHz and sends FFT results (256 frequency bins, each a 32-bit float) every second. The system requires:

  • Less than 100ms latency for anomaly detection
  • 99.99% reliability (equipment damage/downtime is extremely expensive)
  • Local edge processing before cloud upload

Question: What format would you use for sensor-to-edge communication vs edge-to-cloud communication? Justify your choices.

Question: Which format pairing best matches the recommendation for low-latency local ingest and schema-enforced edge-to-cloud messaging?

Explanation: C. The high-rate local link benefits from ultra-fast parsing/zero-copy access, while edge-to-cloud benefits from schema enforcement and tooling (e.g., Protobuf/gRPC).

Two-Tier Format Strategy:

Sensor-to-Edge: Custom Binary or FlatBuffers

  • Payload: 256 floats x 4 bytes = 1,024 bytes per message
  • Messages: 200 sensors x 1/sec = 200 messages/sec = 204.8 KB/sec
  • Latency requirement: <100ms means format must be ultra-fast to parse

Why Custom Binary/FlatBuffers:

  • Zero-copy access - read floats directly from buffer, no parsing
  • Predictable latency - no variable-length decoding surprises
  • Minimal CPU overhead - critical when processing 200 streams simultaneously
  • FlatBuffers advantage - provides schema evolution if needed later

Edge-to-Cloud: Protocol Buffers with Delta Compression

After edge processing, only anomalies and aggregates are sent:

  • Normal operation: 1 aggregate per minute per sensor (summary stats)
  • Anomaly detected: Full FFT spectrum + alerts

Why Protobuf:

  • Schema enforcement - cloud and edge must agree on data format
  • Compression-friendly - repeated similar values compress well
  • Language-agnostic - edge might be C++, cloud is Python/Java
  • gRPC integration - natural fit for edge-to-cloud RPC patterns

Bandwidth comparison:

Path Raw Data With Strategy Savings
Sensor to Edge 204.8 KB/s 204.8 KB/s 0% (local)
Edge to Cloud (normal) 204.8 KB/s 2 KB/s (aggregates) 99%
Edge to Cloud (anomaly) 204.8 KB/s 10 KB/s (selected FFTs) 95%

Key insight: Format choice depends on communication path. Local high-speed links can afford larger payloads; cloud links need aggressive optimization.


44.4 Worked Example: Payload Design for Battery-Constrained Sensor

44.5 Worked Example: Optimizing Payload Size for LoRaWAN Sensor

Scenario: You are designing a soil moisture sensor for precision agriculture. The sensor runs on 2x AA batteries and must operate for 2+ years. It connects via LoRaWAN (max payload 51 bytes in SF12 mode) and sends readings every 15 minutes. Each reading includes: soil moisture (0-100%), temperature (-40 to +60C), battery voltage (2.0-3.6V), and timestamp.

Goal: Design the most efficient payload format that fits LoRaWAN constraints while maximizing battery life.

What we do: Determine the precision and range needed for each data field.

Why: Choosing appropriate precision prevents wasting bytes on unnecessary accuracy.

Field analysis:

Field Range Required Precision Example Value
Soil moisture 0-100% 0.5% resolution 42.5%
Temperature -40 to +60C 0.1C resolution 23.7C
Battery voltage 2.0-3.6V 0.01V resolution 3.24V
Timestamp Unix epoch 1-second resolution 1702732800

Key insight: We don’t need floating point! All values can be encoded as integers with implied decimal places:

  • 42.5% becomes 425 (divide by 10 on receive)
  • 23.7C becomes 237 (divide by 10 on receive)
  • 3.24V becomes 324 (divide by 100 on receive)

What we do: Calculate payload size for JSON, CBOR, and custom binary formats.

Why: Size directly impacts battery life (transmission energy) and LoRaWAN feasibility.

JSON approach (human-readable):

{"m":42.5,"t":23.7,"v":3.24,"ts":1702732800}
  • Size: 44 bytes
  • Pros: Debuggable, self-describing
  • Cons: Quotes, colons, braces waste 20+ bytes

CBOR approach (binary JSON):

A4                          # Map with 4 items
  61 6D                     # "m" (1 char)
  F9 4528                   # 42.5 as float16
  61 74                     # "t" (1 char)
  F9 43C3                   # 23.7 as float16
  61 76                     # "v" (1 char)
  F9 40A3                   # 3.24 as float16
  62 7473                   # "ts" (2 chars)
  1A 6577F080               # 1702732800 as uint32
  • Size: 22 bytes
  • Pros: Self-describing, standard format
  • Cons: Field names still consume bytes

Custom binary approach (maximum efficiency):

Byte layout:
[0-1]:   moisture = 425 (uint16, value/10 = 42.5%)
[2-3]:   temperature = 637 (uint16 with offset, (val-400)/10 = 23.7C)
[4-5]:   battery = 324 (uint16, value/100 = 3.24V)
[6-9]:   timestamp (uint32, seconds since epoch)

Hex: 01A9 027D 0144 6577F080
  • Size: 10 bytes (77% smaller than JSON!)
  • Pros: Minimal size, maximum battery life
  • Cons: Not self-describing, requires documentation

What we do: Show exact byte-by-byte calculations for the custom binary format.

Why: Understanding the encoding enables implementation and debugging.

Custom binary encoding detail:

Field Raw Value Encoding Bytes Hex
Moisture 42.5% 42.5 x 10 = 425 2 0x01A9
Temperature 23.7C (23.7 + 40) x 10 = 637 2 0x027D
Battery 3.24V 3.24 x 100 = 324 2 0x0144
Timestamp 1702732800 Direct uint32 4 0x6577F080
Total 10

Temperature encoding trick: Adding 40 offset makes -40C = 0, allowing unsigned integer storage:

  • -40C becomes (-40+40) x 10 = 0
  • +60C becomes (60+40) x 10 = 1000
  • Range 0-1000 fits in uint16 (0-65535)

Decoding formula (receiver side):

def decode_sensor_payload(data):
    moisture = struct.unpack('>H', data[0:2])[0] / 10.0      # 425 -> 42.5%
    temp_raw = struct.unpack('>H', data[2:4])[0]
    temperature = (temp_raw / 10.0) - 40.0                    # 637 -> 23.7C
    battery = struct.unpack('>H', data[4:6])[0] / 100.0      # 324 -> 3.24V
    timestamp = struct.unpack('>I', data[6:10])[0]           # Direct
    return moisture, temperature, battery, timestamp

What we do: Quantify how payload size affects battery life.

Why: For LoRaWAN devices, transmission energy dominates power budget.

LoRaWAN transmission energy (SF12, 14dBm):

  • Base overhead: ~60 bytes (header, CRC, etc.)
  • Energy per byte: ~0.5 mJ (millijoules)

Daily energy consumption:

Format Payload Total TX Energy/msg Msgs/day Energy/day
JSON 44 bytes 104 bytes 52 mJ 96 4,992 mJ
CBOR 22 bytes 82 bytes 41 mJ 96 3,936 mJ
Binary 10 bytes 70 bytes 35 mJ 96 3,360 mJ

Battery life calculation (2x AA = 2000 mAh = 27,000 J at 3V):

Format Daily Energy Battery Life Improvement
JSON 4,992 mJ 5,413 days (14.8 years) Baseline
CBOR 3,936 mJ 6,861 days (18.8 years) +27%
Binary 3,360 mJ 8,036 days (22.0 years) +49%

Reality check: These calculations assume only TX energy. Real battery life is 2-3 years due to sensor, MCU, and leakage. But relative differences still apply!

Outcome: Custom binary format maximizes battery life within LoRaWAN constraints.

Final payload specification:

+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Byte 0 | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Byte 7 | Byte 8 | Byte 9 |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|    Moisture     |   Temperature   |    Battery      |           Timestamp             |
|    (uint16)     | (uint16+offset) |    (uint16)     |            (uint32)             |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+

Encoding:
- Moisture: value x 10 (42.5% becomes 425)
- Temperature: (value + 40) x 10 (23.7C becomes 637)
- Battery: value x 100 (3.24V becomes 324)
- Timestamp: Unix epoch seconds (big-endian)

Total: 10 bytes (fits LoRaWAN SF12 with room to spare)

Key design decisions:

  1. No field names - Documentation-based format (77% size reduction)
  2. Fixed-point integers - Avoid float overhead, preserve precision
  3. Temperature offset - Enables unsigned encoding of negative values
  4. Big-endian - Network byte order for consistency
  5. Timestamp included - Handles network delays, out-of-order delivery

Implementation note:

// ESP32/Arduino encoding
void encode_payload(uint8_t* buf, float moisture, float temp, float battery, uint32_t ts) {
    uint16_t m = (uint16_t)(moisture * 10);
    uint16_t t = (uint16_t)((temp + 40) * 10);
    uint16_t v = (uint16_t)(battery * 100);
    buf[0] = m >> 8; buf[1] = m & 0xFF;
    buf[2] = t >> 8; buf[3] = t & 0xFF;
    buf[4] = v >> 8; buf[5] = v & 0xFF;
    buf[6] = ts >> 24; buf[7] = (ts >> 16) & 0xFF;
    buf[8] = (ts >> 8) & 0xFF; buf[9] = ts & 0xFF;
}

44.6 Additional Knowledge Check

Knowledge Check: Data Format Selection Quick Check

Concept: Choosing the right data format for IoT applications.

Question: A smart home system sends temperature readings every 5 seconds over Wi-Fi. The developer wants easy debugging during development but acceptable production performance. Which format is best?

Explanation: B is correct. Wi-Fi has plenty of bandwidth (50+ Mbps typical), so JSON’s overhead is negligible. JSON’s human-readability enables easy debugging during development. Custom binary and Protobuf add complexity without meaningful benefit when bandwidth isn’t constrained.

Question: What is the primary advantage of CBOR over JSON for IoT applications?

Explanation: C is correct. CBOR (Concise Binary Object Representation) is essentially “binary JSON” - it maintains the self-describing, schema-less flexibility of JSON but encodes values in binary, reducing size by 30-60%. It does support additional types (like binary blobs) but the size reduction is the primary IoT benefit.

Question: A LoRaWAN sensor must transmit 4 sensor values (each 0-1000) in a 12-byte payload limit. What encoding approach fits?

Explanation: A is correct. Values 0-1000 fit in 2-byte unsigned integers (0-65535 range). 4 values x 2 bytes = 8 bytes, well under the 12-byte limit. JSON would be ~18 bytes minimum. CBOR/MessagePack would be 12-14 bytes (borderline). Custom binary is the only format that fits comfortably with margin for headers.

Question: A company deploys sensors that will run for 10 years. The data format may need to evolve (add new fields). Which format handles schema evolution best?

Explanation: D is correct. Protocol Buffers uses field numbers that enable backward compatibility (old code ignores new fields) and forward compatibility (new code handles missing fields). JSON allows adding fields but lacks type safety. Custom binary version bytes work but require manual migration code. Protobuf’s field numbering system was specifically designed for long-term schema evolution.


44.8 Summary

Key Takeaways from Practice:

  1. Context matters more than raw size - JSON is often fine for Wi-Fi deployments
  2. Calculate total cost of ownership - Include development, maintenance, and debugging costs
  3. Plan for evolution - Schema changes are inevitable over device lifetimes
  4. Use two-tier strategies - Different formats for local vs cloud communication
  5. Migration requires careful planning - Never remove support for old formats immediately

Practice Checklist:

  • Can you calculate payload sizes for JSON, CBOR, Protobuf, and custom binary?
  • Can you design a custom binary encoding for given sensor data?
  • Can you justify format choices with quantitative analysis?
  • Can you plan a migration strategy from JSON to binary formats?

44.9 What’s Next

Now that you’ve practiced data format selection and design:

Return to Data Formats Index →