44 Data Formats Practice and Assessment

44.1 Learning Objectives

By the end of this chapter, you will be able to:

Solve format selection scenarios: Apply decision frameworks to real-world IoT cases
Calculate payload sizes: Design efficient binary encodings for specific sensor data
Debug encoding issues: Identify and fix common serialization problems
Design migration strategies: Plan format upgrades without service disruption
Justify recommendations: Defend format choices with quantitative analysis

Related Chapters

This is part of a series on IoT Data Formats:

IoT Data Formats Overview - Introduction and text formats
Binary Data Formats - CBOR, Protobuf, custom binary
Data Format Selection - Decision guides and real-world examples
Data Formats Practice (this chapter) - Scenarios, quizzes, worked examples

44.2 Prerequisites

Before starting this chapter, you should have completed:

IoT Data Formats Overview: Understanding JSON and format basics
Binary Data Formats: CBOR, Protobuf, and custom binary
Data Format Selection: Decision frameworks

44.3 Knowledge Check Scenarios

Test your understanding of IoT data formats with these scenario-based questions.

Scenario 1: Smart Meter Deployment

Situation: You’re deploying 10,000 smart electricity meters across a city using NB-IoT cellular connectivity. Each meter sends readings every 15 minutes, including:

Current power consumption (float)
Voltage (float)
Meter ID (8-character string)
Timestamp (Unix epoch)

The cellular plan costs $0.02 per MB. You’re evaluating JSON vs CBOR vs Protocol Buffers.

Question: Which format would you choose, and why? Calculate the annual data cost difference between JSON and your recommended format.

Question: Using the size estimates given, what is the approximate annual savings of Protobuf compared to JSON?

Explanation: B. JSON = $526/year and Protobuf = $140/year, so savings = $526 - $140 = $386/year.

Answer

Recommended: Protocol Buffers (or CBOR)

Calculation:

Messages per meter per year: 4/hour x 24 hours x 365 days = 35,040 messages
Total messages: 35,040 x 10,000 meters = 350.4 million messages/year

Format size estimates for this payload:

JSON: ~75 bytes (field names, quotes, braces)
CBOR: ~40 bytes (binary encoding, field names still present)
Protobuf: ~20 bytes (field numbers instead of names)

Annual data costs:

JSON: 350.4M x 75 bytes = 26.3 GB/year = $526/year
CBOR: 350.4M x 40 bytes = 14.0 GB/year = $280/year
Protobuf: 350.4M x 20 bytes = 7.0 GB/year = $140/year

Savings with Protobuf vs JSON: $386/year (73% reduction)

Why Protobuf over CBOR? At this scale (10,000 devices), the additional upfront schema definition cost is justified by:

Stronger typing prevents data corruption bugs
Efficient code generation for both meters and cloud
Schema evolution handled gracefully
Industry standard for high-volume IoT pipelines

Scenario 2: Agricultural Sensor on LoRaWAN

Situation: You’re adding a new soil nutrient sensor to an existing LoRaWAN deployment. The sensor measures:

pH level (0-14, one decimal precision)
Nitrogen content (mg/kg, integer 0-1000)
Phosphorus content (mg/kg, integer 0-1000)
Potassium content (mg/kg, integer 0-1000)
Soil temperature (C, one decimal, -20 to 60)

The LoRaWAN gateway is at the edge of its range (SF12), limiting payload to 51 bytes maximum. Readings are sent every 4 hours.

Question: Can you use JSON for this sensor? If not, design a custom binary format that fits within the 51-byte limit.

Question: In the proposed binary layout, how many bytes are needed for the five measurements (excluding device ID/timestamp)?

Explanation: B. pH (1) + N (2) + P (2) + K (2) + temp (2) = 9 bytes.

Answer

JSON is risky and not recommended here.

Sample JSON: {"pH":7.2,"N":450,"P":320,"K":280,"temp":18.5} = 48 bytes

This barely fits, but:

No room for error (one more decimal place breaks it)
No room for device ID or timestamp
JSON parsing on constrained MCU is expensive

Custom Binary Format (9 bytes total):

Field	Encoding	Bytes	Range	Notes
pH	uint8 (value x 10)	1	0-140 -> 0.0-14.0	Divide by 10 to decode
Nitrogen	uint16	2	0-1000	Direct value
Phosphorus	uint16	2	0-1000	Direct value
Potassium	uint16	2	0-1000	Direct value
Temperature	int16 (C x 10)	2	-200 to 600 -> -20.0 to 60.0C	Divide by 10 to decode
Total		9 bytes		~81% smaller than JSON

Benefits:

9 bytes vs 48 bytes = 81% bandwidth savings
42 bytes remaining for device ID, timestamp, battery voltage
Lower transmission energy = longer battery life
Simpler parsing on constrained MCU

Alternative: CBOR would be ~20 bytes, which also fits and provides more flexibility than custom binary.

Scenario 3: Format Migration Strategy

Situation: Your company has 50,000 deployed sensors sending JSON over MQTT. Management wants to reduce cellular data costs by 50%. The sensors have limited flash memory (32KB available for firmware updates) and are deployed in remote locations (firmware updates are expensive to verify).

Question: What format would you migrate to, and what is your migration strategy to avoid bricking devices or losing data during the transition?

Question: Which format is the best migration target if you want a smaller payload than JSON, minimal schema coordination, and a small library footprint?

Explanation: B. CBOR keeps a JSON-like data model but encodes it compactly, and the cloud can decode without strict schema synchronization during rollout.

Answer

Recommended Format: CBOR

Why CBOR over Protobuf for this migration:

Same data model as JSON - minimal code changes required
Self-describing - cloud can decode without schema synchronization
Smaller library footprint - fits in 32KB flash constraint
No breaking changes - can decode both JSON and CBOR during transition

Migration Strategy (Zero-Downtime):

Phase 1: Cloud Preparation (Week 1)

Update cloud ingest to accept both JSON and CBOR
Add content-type header detection (application/json vs application/cbor)
Validate CBOR decoding matches JSON semantics

Phase 2: Gradual Firmware Rollout (Weeks 2-8)

Push firmware update to 1% of devices (500 sensors)
Monitor for decode errors, data quality issues
If successful, expand to 10%, 50%, 100%
New firmware sends CBOR but can fallback to JSON on decode error response

Phase 3: Transition Monitoring (Weeks 8-12)

Track ratio of JSON vs CBOR messages
Identify any devices that failed to update
Cloud continues accepting both formats indefinitely for stragglers

Phase 4: Cost Verification

Measure actual bandwidth reduction (target: 40-50%)
Calculate ROI: savings vs firmware update costs

Critical Success Factors:

Never remove JSON support from cloud (some devices may never update)
Include version field in firmware to track migration status
Have rollback plan if CBOR library causes stability issues

Scenario 4: Real-Time Industrial Monitoring

Situation: A factory floor has 200 vibration sensors monitoring machine health. Each sensor samples at 1 kHz and sends FFT results (256 frequency bins, each a 32-bit float) every second. The system requires:

Less than 100ms latency for anomaly detection
99.99% reliability (equipment damage/downtime is extremely expensive)
Local edge processing before cloud upload

Question: What format would you use for sensor-to-edge communication vs edge-to-cloud communication? Justify your choices.

Question: Which format pairing best matches the recommendation for low-latency local ingest and schema-enforced edge-to-cloud messaging?

Explanation: C. The high-rate local link benefits from ultra-fast parsing/zero-copy access, while edge-to-cloud benefits from schema enforcement and tooling (e.g., Protobuf/gRPC).

Answer

Two-Tier Format Strategy:

Sensor-to-Edge: Custom Binary or FlatBuffers

Payload: 256 floats x 4 bytes = 1,024 bytes per message
Messages: 200 sensors x 1/sec = 200 messages/sec = 204.8 KB/sec
Latency requirement: <100ms means format must be ultra-fast to parse

Why Custom Binary/FlatBuffers:

Zero-copy access - read floats directly from buffer, no parsing
Predictable latency - no variable-length decoding surprises
Minimal CPU overhead - critical when processing 200 streams simultaneously
FlatBuffers advantage - provides schema evolution if needed later

Edge-to-Cloud: Protocol Buffers with Delta Compression

After edge processing, only anomalies and aggregates are sent:

Normal operation: 1 aggregate per minute per sensor (summary stats)
Anomaly detected: Full FFT spectrum + alerts

Why Protobuf:

Schema enforcement - cloud and edge must agree on data format
Compression-friendly - repeated similar values compress well
Language-agnostic - edge might be C++, cloud is Python/Java
gRPC integration - natural fit for edge-to-cloud RPC patterns

Bandwidth comparison:

Path	Raw Data	With Strategy	Savings
Sensor to Edge	204.8 KB/s	204.8 KB/s	0% (local)
Edge to Cloud (normal)	204.8 KB/s	2 KB/s (aggregates)	99%
Edge to Cloud (anomaly)	204.8 KB/s	10 KB/s (selected FFTs)	95%

Key insight: Format choice depends on communication path. Local high-speed links can afford larger payloads; cloud links need aggressive optimization.

44.4 Worked Example: Payload Design for Battery-Constrained Sensor

44.5 Worked Example: Optimizing Payload Size for LoRaWAN Sensor

Scenario: You are designing a soil moisture sensor for precision agriculture. The sensor runs on 2x AA batteries and must operate for 2+ years. It connects via LoRaWAN (max payload 51 bytes in SF12 mode) and sends readings every 15 minutes. Each reading includes: soil moisture (0-100%), temperature (-40 to +60C), battery voltage (2.0-3.6V), and timestamp.

Goal: Design the most efficient payload format that fits LoRaWAN constraints while maximizing battery life.

What we do: Determine the precision and range needed for each data field.

Why: Choosing appropriate precision prevents wasting bytes on unnecessary accuracy.

Field analysis:

Field	Range	Required Precision	Example Value
Soil moisture	0-100%	0.5% resolution	42.5%
Temperature	-40 to +60C	0.1C resolution	23.7C
Battery voltage	2.0-3.6V	0.01V resolution	3.24V
Timestamp	Unix epoch	1-second resolution	1702732800

Key insight: We don’t need floating point! All values can be encoded as integers with implied decimal places:

42.5% becomes 425 (divide by 10 on receive)
23.7C becomes 237 (divide by 10 on receive)
3.24V becomes 324 (divide by 100 on receive)

What we do: Calculate payload size for JSON, CBOR, and custom binary formats.

Why: Size directly impacts battery life (transmission energy) and LoRaWAN feasibility.

JSON approach (human-readable):

{"m":42.5,"t":23.7,"v":3.24,"ts":1702732800}

Size: 44 bytes
Pros: Debuggable, self-describing
Cons: Quotes, colons, braces waste 20+ bytes

CBOR approach (binary JSON):

A4                          # Map with 4 items
  61 6D                     # "m" (1 char)
  F9 4528                   # 42.5 as float16
  61 74                     # "t" (1 char)
  F9 43C3                   # 23.7 as float16
  61 76                     # "v" (1 char)
  F9 40A3                   # 3.24 as float16
  62 7473                   # "ts" (2 chars)
  1A 6577F080               # 1702732800 as uint32

Size: 22 bytes
Pros: Self-describing, standard format
Cons: Field names still consume bytes

Custom binary approach (maximum efficiency):

Byte layout:
[0-1]:   moisture = 425 (uint16, value/10 = 42.5%)
[2-3]:   temperature = 637 (uint16 with offset, (val-400)/10 = 23.7C)
[4-5]:   battery = 324 (uint16, value/100 = 3.24V)
[6-9]:   timestamp (uint32, seconds since epoch)

Hex: 01A9 027D 0144 6577F080

Size: 10 bytes (77% smaller than JSON!)
Pros: Minimal size, maximum battery life
Cons: Not self-describing, requires documentation

What we do: Show exact byte-by-byte calculations for the custom binary format.

Why: Understanding the encoding enables implementation and debugging.

Custom binary encoding detail:

Field	Raw Value	Encoding	Bytes	Hex
Moisture	42.5%	42.5 x 10 = 425	2	0x01A9
Temperature	23.7C	(23.7 + 40) x 10 = 637	2	0x027D
Battery	3.24V	3.24 x 100 = 324	2	0x0144
Timestamp	1702732800	Direct uint32	4	0x6577F080
Total			10

Temperature encoding trick: Adding 40 offset makes -40C = 0, allowing unsigned integer storage:

-40C becomes (-40+40) x 10 = 0
+60C becomes (60+40) x 10 = 1000
Range 0-1000 fits in uint16 (0-65535)

Decoding formula (receiver side):

def decode_sensor_payload(data):
    moisture = struct.unpack('>H', data[0:2])[0] / 10.0      # 425 -> 42.5%
    temp_raw = struct.unpack('>H', data[2:4])[0]
    temperature = (temp_raw / 10.0) - 40.0                    # 637 -> 23.7C
    battery = struct.unpack('>H', data[4:6])[0] / 100.0      # 324 -> 3.24V
    timestamp = struct.unpack('>I', data[6:10])[0]           # Direct
    return moisture, temperature, battery, timestamp

What we do: Quantify how payload size affects battery life.

Why: For LoRaWAN devices, transmission energy dominates power budget.

LoRaWAN transmission energy (SF12, 14dBm):

Base overhead: ~60 bytes (header, CRC, etc.)
Energy per byte: ~0.5 mJ (millijoules)

Daily energy consumption:

Format	Payload	Total TX	Energy/msg	Msgs/day	Energy/day
JSON	44 bytes	104 bytes	52 mJ	96	4,992 mJ
CBOR	22 bytes	82 bytes	41 mJ	96	3,936 mJ
Binary	10 bytes	70 bytes	35 mJ	96	3,360 mJ

Battery life calculation (2x AA = 2000 mAh = 27,000 J at 3V):

Format	Daily Energy	Battery Life	Improvement
JSON	4,992 mJ	5,413 days (14.8 years)	Baseline
CBOR	3,936 mJ	6,861 days (18.8 years)	+27%
Binary	3,360 mJ	8,036 days (22.0 years)	+49%

Reality check: These calculations assume only TX energy. Real battery life is 2-3 years due to sensor, MCU, and leakage. But relative differences still apply!

Outcome: Custom binary format maximizes battery life within LoRaWAN constraints.

Final payload specification:

+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Byte 0 | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Byte 7 | Byte 8 | Byte 9 |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|    Moisture     |   Temperature   |    Battery      |           Timestamp             |
|    (uint16)     | (uint16+offset) |    (uint16)     |            (uint32)             |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+

Encoding:
- Moisture: value x 10 (42.5% becomes 425)
- Temperature: (value + 40) x 10 (23.7C becomes 637)
- Battery: value x 100 (3.24V becomes 324)
- Timestamp: Unix epoch seconds (big-endian)

Total: 10 bytes (fits LoRaWAN SF12 with room to spare)

Key design decisions:

No field names - Documentation-based format (77% size reduction)
Fixed-point integers - Avoid float overhead, preserve precision
Temperature offset - Enables unsigned encoding of negative values
Big-endian - Network byte order for consistency
Timestamp included - Handles network delays, out-of-order delivery

Implementation note:

// ESP32/Arduino encoding
void encode_payload(uint8_t* buf, float moisture, float temp, float battery, uint32_t ts) {
    uint16_t m = (uint16_t)(moisture * 10);
    uint16_t t = (uint16_t)((temp + 40) * 10);
    uint16_t v = (uint16_t)(battery * 100);
    buf[0] = m >> 8; buf[1] = m & 0xFF;
    buf[2] = t >> 8; buf[3] = t & 0xFF;
    buf[4] = v >> 8; buf[5] = v & 0xFF;
    buf[6] = ts >> 24; buf[7] = (ts >> 16) & 0xFF;
    buf[8] = (ts >> 8) & 0xFF; buf[9] = ts & 0xFF;
}

44.6 Additional Knowledge Check

Knowledge Check: Data Format Selection Quick Check

Concept: Choosing the right data format for IoT applications.

Question: A smart home system sends temperature readings every 5 seconds over Wi-Fi. The developer wants easy debugging during development but acceptable production performance. Which format is best?

Explanation: B is correct. Wi-Fi has plenty of bandwidth (50+ Mbps typical), so JSON’s overhead is negligible. JSON’s human-readability enables easy debugging during development. Custom binary and Protobuf add complexity without meaningful benefit when bandwidth isn’t constrained.

Question: What is the primary advantage of CBOR over JSON for IoT applications?

Explanation: C is correct. CBOR (Concise Binary Object Representation) is essentially “binary JSON” - it maintains the self-describing, schema-less flexibility of JSON but encodes values in binary, reducing size by 30-60%. It does support additional types (like binary blobs) but the size reduction is the primary IoT benefit.

Question: A LoRaWAN sensor must transmit 4 sensor values (each 0-1000) in a 12-byte payload limit. What encoding approach fits?

Explanation: A is correct. Values 0-1000 fit in 2-byte unsigned integers (0-65535 range). 4 values x 2 bytes = 8 bytes, well under the 12-byte limit. JSON would be ~18 bytes minimum. CBOR/MessagePack would be 12-14 bytes (borderline). Custom binary is the only format that fits comfortably with margin for headers.

Question: A company deploys sensors that will run for 10 years. The data format may need to evolve (add new fields). Which format handles schema evolution best?

Explanation: D is correct. Protocol Buffers uses field numbers that enable backward compatibility (old code ignores new fields) and forward compatibility (new code handles missing fields). JSON allows adding fields but lacks type safety. Custom binary version bytes work but require manual migration code. Protobuf’s field numbering system was specifically designed for long-term schema evolution.

44.7 Visual Reference Gallery

The following AI-generated figures provide alternative visual perspectives on data format concepts.

JSON vs XML Comparison (Alternative View)

Artistic side-by-side comparison of JSON and XML data formats showing identical sensor data represented in both formats, highlighting structural differences, verbosity levels, and parsing complexity

JSON and XML Format Comparison

JSON and XML are the two most widely used text-based data formats in IoT. This visualization contrasts their structural approaches: XML uses hierarchical tags with explicit opening and closing elements, while JSON uses a more compact key-value notation with curly braces and square brackets. For IoT applications, JSON typically offers 30-50% smaller payloads than equivalent XML, faster parsing on resource-constrained devices, and broader support in modern programming languages and cloud platforms.

Serialization Process (Alternative View)

Geometric diagram showing the serialization process from in-memory data structures to byte stream representation, with deserialization shown as the reverse process

Data Serialization Process Visualization

Serialization transforms structured data (objects, arrays, sensor readings) into a sequential byte stream for transmission or storage. This process is fundamental to IoT communication: sensor readings in memory must be serialized before network transmission, then deserialized at the receiving end. The choice of serialization format directly impacts transmission size, parsing speed, and interoperability between heterogeneous IoT devices and cloud services.

Data Encoding Formats Overview (Alternative View)

Geometric overview comparing major IoT data encoding formats including JSON, CBOR, MessagePack, Protocol Buffers, and custom binary, showing relative size, parsing speed, and use case recommendations

Data Encoding Formats for IoT

IoT systems can choose from multiple data encoding formats, each with distinct trade-offs. Text formats (JSON, XML) prioritize human readability and debugging ease. Binary formats (CBOR, MessagePack, Protocol Buffers) optimize for compact size and parsing speed. Custom binary formats offer maximum efficiency but sacrifice interoperability. This visualization helps engineers select the appropriate format based on their bandwidth constraints, processing power, and ecosystem requirements.

44.8 Summary

Key Takeaways from Practice:

Context matters more than raw size - JSON is often fine for Wi-Fi deployments
Calculate total cost of ownership - Include development, maintenance, and debugging costs
Plan for evolution - Schema changes are inevitable over device lifetimes
Use two-tier strategies - Different formats for local vs cloud communication
Migration requires careful planning - Never remove support for old formats immediately

Practice Checklist:

Can you calculate payload sizes for JSON, CBOR, Protobuf, and custom binary?
Can you design a custom binary encoding for given sensor data?
Can you justify format choices with quantitative analysis?
Can you plan a migration strategy from JSON to binary formats?

44.9 What’s Next

Now that you’ve practiced data format selection and design:

Apply knowledge: Packet Structure and Framing - See how formats fit into protocol headers
Build projects: MQTT Fundamentals - Implement JSON/CBOR with MQTT
Interactive tool: Protocol Selector Wizard - Get recommendations for your specific scenario

Return to Data Formats Index →