11 Data Formats Practice and Assessment

In 60 Seconds

Practice choosing IoT data formats through real-world scenarios: smart meter deployments, LoRaWAN agriculture sensors, format migration strategies, and industrial monitoring. Learn to calculate payload sizes, compare annual costs, and design custom binary encodings that maximize battery life.

11.1 Learning Objectives

By the end of this chapter, you will be able to:

Solve format selection scenarios: Apply decision frameworks to real-world IoT cases involving smart meters, agriculture sensors, and industrial monitoring
Calculate payload sizes and costs: Compute byte-level encodings and annual bandwidth expenses for JSON, CBOR, Protobuf, and custom binary formats
Debug encoding issues: Diagnose and fix common serialization problems such as Base64 overhead, endianness mismatches, and precision loss
Design migration strategies: Plan phased format upgrades from JSON to binary without service disruption or data loss
Evaluate battery-life trade-offs: Quantify how payload size affects transmission energy and device longevity for LoRaWAN deployments
Justify format recommendations: Defend format choices with quantitative cost-benefit analysis across the full sensor-to-cloud pipeline

Key Concepts

JSON: Human-readable key-value format — default IoT API format but 2-5× larger than binary alternatives
CBOR: Concise Binary Object Representation — binary superset of JSON data model, no schema required
Protocol Buffers (Protobuf): Schema-defined binary serialization achieving maximum efficiency with code-generated parsers
MessagePack: Binary-JSON bridge: no schema needed, 30-50% smaller than JSON, drop-in replacement for JSON libraries
Serialization: Converting in-memory data structures to bytes for transmission or storage
Deserialization: Reconstructing data structures from bytes — must match the serialization format exactly
Data Format Trade-off: Human readability (JSON) vs. size efficiency (CBOR/Protobuf) vs. schema flexibility (self-describing vs. compiled)

11.2 For Beginners: Data Formats Practice

This chapter is a hands-on exercise where you practice choosing the best way to package sensor data for different situations. Think of data formats like choosing between sending a text message, an email, or a letter – each has trade-offs in size, speed, and cost. You will work through real scenarios (like a farm with thousands of sensors) and learn to calculate which format saves the most battery and bandwidth.

Related Chapters

This is part of a series on IoT Data Formats:

IoT Data Formats Overview - Introduction and text formats
Binary Data Formats - CBOR, Protobuf, custom binary
Data Format Selection - Decision guides and real-world examples
Data Formats Practice (this chapter) - Scenarios, quizzes, worked examples

11.3 Prerequisites

Before starting this chapter, you should have completed:

IoT Data Formats Overview: Understanding JSON and format basics
Binary Data Formats: CBOR, Protobuf, and custom binary
Data Format Selection: Decision frameworks

11.4 Knowledge Check Scenarios

Test your understanding of IoT data formats with these scenario-based questions.

Scenario 1: Smart Meter Deployment

Situation: You’re deploying 10,000 smart electricity meters across a city using NB-IoT cellular connectivity. Each meter sends readings every 15 minutes, including:

Current power consumption (float)
Voltage (float)
Meter ID (8-character string)
Timestamp (Unix epoch)

The cellular plan costs $0.02 per MB. You’re evaluating JSON vs CBOR vs Protocol Buffers.

Question: Which format would you choose, and why? Calculate the annual data cost difference between JSON and your recommended format.

Answer

Recommended: Protocol Buffers (or CBOR)

Calculation:

Messages per meter per year: 4/hour x 24 hours x 365 days = 35,040 messages
Total messages: 35,040 x 10,000 meters = 350.4 million messages/year

Format size estimates for this payload:

JSON: ~75 bytes (field names, quotes, braces)
CBOR: ~40 bytes (binary encoding, field names still present)
Protobuf: ~20 bytes (field numbers instead of names)

Annual data costs:

JSON: 350.4M x 75 bytes = 26.3 GB/year = $526/year
CBOR: 350.4M x 40 bytes = 14.0 GB/year = $280/year
Protobuf: 350.4M x 20 bytes = 7.0 GB/year = $140/year

Savings with Protobuf vs JSON: $386/year (73% reduction)

Interactive Calculator: Annual Data Cost

Show code

viewof num_devices = Inputs.range([100, 50000], {value: 10000, step: 100, label: "Number of devices"})
viewof msgs_per_hour = Inputs.range([1, 24], {value: 4, step: 1, label: "Messages per hour"})
viewof json_bytes = Inputs.range([50, 200], {value: 75, step: 5, label: "JSON payload size (bytes)"})
viewof cbor_bytes = Inputs.range([20, 100], {value: 40, step: 5, label: "CBOR payload size (bytes)"})
viewof protobuf_bytes = Inputs.range([10, 80], {value: 20, step: 2, label: "Protobuf payload size (bytes)"})
viewof cost_per_mb = Inputs.range([0.01, 0.10], {value: 0.02, step: 0.01, label: "Cost per MB ($)"})

Show code

cost_calc = {
  const msgs_per_year = msgs_per_hour * 24 * 365 * num_devices;
  const json_gb = (msgs_per_year * json_bytes) / (1024 * 1024 * 1024);
  const cbor_gb = (msgs_per_year * cbor_bytes) / (1024 * 1024 * 1024);
  const protobuf_gb = (msgs_per_year * protobuf_bytes) / (1024 * 1024 * 1024);
  const json_cost = (json_gb * 1024) * cost_per_mb;
  const cbor_cost = (cbor_gb * 1024) * cost_per_mb;
  const protobuf_cost = (protobuf_gb * 1024) * cost_per_mb;
  return {
    msgs_per_year: msgs_per_year,
    json_gb: json_gb,
    json_cost: json_cost,
    cbor_gb: cbor_gb,
    cbor_cost: cbor_cost,
    protobuf_gb: protobuf_gb,
    protobuf_cost: protobuf_cost,
    savings_cbor: json_cost - cbor_cost,
    savings_protobuf: json_cost - protobuf_cost
  };
}

Show code

html`<div style="background: linear-gradient(135deg, #f8f9fa 0%, #e9ecef 100%); padding: 1.5rem; border-radius: 8px; border-left: 4px solid #3498DB; margin-top: 1rem; box-shadow: 0 2px 4px rgba(0,0,0,0.1);">
<p style="margin: 0 0 1rem 0;"><strong>Total Messages per Year:</strong> ${cost_calc.msgs_per_year.toLocaleString()}</p>
<table style="width:100%; border-collapse:collapse; background: white; border-radius: 4px; overflow: hidden;">
<thead style="background: #2C3E50; color: white;">
<tr>
<th style="padding: 0.75rem; text-align: left;">Format</th>
<th style="padding: 0.75rem; text-align: right;">Annual Data</th>
<th style="padding: 0.75rem; text-align: right;">Annual Cost</th>
<th style="padding: 0.75rem; text-align: right;">Savings vs JSON</th>
</tr>
</thead>
<tbody>
<tr style="border-bottom: 1px solid #ddd;">
<td style="padding: 0.75rem;"><strong>JSON</strong></td>
<td style="padding: 0.75rem; text-align: right;">${cost_calc.json_gb.toFixed(1)} GB</td>
<td style="padding: 0.75rem; text-align: right;">$${cost_calc.json_cost.toFixed(2)}</td>
<td style="padding: 0.75rem; text-align: right; color: #7F8C8D;">—</td>
</tr>
<tr style="border-bottom: 1px solid #ddd; background: #f8f9fa;">
<td style="padding: 0.75rem;"><strong>CBOR</strong></td>
<td style="padding: 0.75rem; text-align: right;">${cost_calc.cbor_gb.toFixed(1)} GB</td>
<td style="padding: 0.75rem; text-align: right;">$${cost_calc.cbor_cost.toFixed(2)}</td>
<td style="padding: 0.75rem; text-align: right; color: #16A085;"><strong>$${cost_calc.savings_cbor.toFixed(2)}</strong> (${((cost_calc.savings_cbor / cost_calc.json_cost) * 100).toFixed(0)}%)</td>
</tr>
<tr style="background: #f8f9fa;">
<td style="padding: 0.75rem;"><strong>Protobuf</strong></td>
<td style="padding: 0.75rem; text-align: right;">${cost_calc.protobuf_gb.toFixed(1)} GB</td>
<td style="padding: 0.75rem; text-align: right;">$${cost_calc.protobuf_cost.toFixed(2)}</td>
<td style="padding: 0.75rem; text-align: right; color: #16A085;"><strong>$${cost_calc.savings_protobuf.toFixed(2)}</strong> (${((cost_calc.savings_protobuf / cost_calc.json_cost) * 100).toFixed(0)}%)</td>
</tr>
</tbody>
</table>
</div>`

Why Protobuf over CBOR? At this scale (10,000 devices), the additional upfront schema definition cost is justified by:

Stronger typing prevents data corruption bugs
Efficient code generation for both meters and cloud
Schema evolution handled gracefully
Industry standard for high-volume IoT pipelines

Scenario 2: Agricultural Sensor on LoRaWAN

Situation: You’re adding a new soil nutrient sensor to an existing LoRaWAN deployment. The sensor measures:

pH level (0-14, one decimal precision)
Nitrogen content (mg/kg, integer 0-1000)
Phosphorus content (mg/kg, integer 0-1000)
Potassium content (mg/kg, integer 0-1000)
Soil temperature (C, one decimal, -20 to 60)

The LoRaWAN gateway is at the edge of its range (SF12), limiting payload to 51 bytes maximum. Readings are sent every 4 hours.

Question: Can you use JSON for this sensor? If not, design a custom binary format that fits within the 51-byte limit.

Answer

JSON is risky and not recommended here.

Sample JSON: {"pH":7.2,"N":450,"P":320,"K":280,"temp":18.5} = 48 bytes

This barely fits, but:

No room for error (one more decimal place breaks it)
No room for device ID or timestamp
JSON parsing on constrained MCU is expensive

Custom Binary Format (9 bytes total):

pH: uint8 (value x 10) using 1 byte Range: 0-140 -> 0.0-14.0 Note: divide by 10 to decode.
Nitrogen: uint16 using 2 bytes Range: 0-1000 Note: direct value.
Phosphorus: uint16 using 2 bytes Range: 0-1000 Note: direct value.
Potassium: uint16 using 2 bytes Range: 0-1000 Note: direct value.
Temperature: int16 (C x 10) using 2 bytes Range: -200 to 600 -> -20.0 to 60.0C Note: divide by 10 to decode.
Total: 9 bytes Result: about 81% smaller than JSON.

Benefits:

9 bytes vs 48 bytes = 81% bandwidth savings
42 bytes remaining for device ID, timestamp, battery voltage
Lower transmission energy = longer battery life
Simpler parsing on constrained MCU

Alternative: CBOR would be ~20 bytes, which also fits and provides more flexibility than custom binary.

Scenario 3: Format Migration Strategy

Situation: Your company has 50,000 deployed sensors sending JSON over MQTT. Management wants to reduce cellular data costs by 50%. The sensors have limited flash memory (32KB available for firmware updates) and are deployed in remote locations (firmware updates are expensive to verify).

Question: What format would you migrate to, and what is your migration strategy to avoid bricking devices or losing data during the transition?

Answer

Recommended Format: CBOR

Why CBOR over Protobuf for this migration:

Same data model as JSON - minimal code changes required
Self-describing - cloud can decode without schema synchronization
Smaller library footprint - fits in 32KB flash constraint
No breaking changes - can decode both JSON and CBOR during transition

Migration Strategy (Zero-Downtime):

Phase 1: Cloud Preparation (Week 1)

Update cloud ingest to accept both JSON and CBOR
Add content-type header detection (application/json vs application/cbor)
Validate CBOR decoding matches JSON semantics

Phase 2: Gradual Firmware Rollout (Weeks 2-8)

Push firmware update to 1% of devices (500 sensors)
Monitor for decode errors, data quality issues
If successful, expand to 10%, 50%, 100%
New firmware sends CBOR but can fallback to JSON on decode error response

Phase 3: Transition Monitoring (Weeks 8-12)

Track ratio of JSON vs CBOR messages
Identify any devices that failed to update
Cloud continues accepting both formats indefinitely for stragglers

Phase 4: Cost Verification

Measure actual bandwidth reduction (target: 40-50%)
Calculate ROI: savings vs firmware update costs

Critical Success Factors:

Never remove JSON support from cloud (some devices may never update)
Include version field in firmware to track migration status
Have rollback plan if CBOR library causes stability issues

Scenario 4: Real-Time Industrial Monitoring

Situation: A factory floor has 200 vibration sensors monitoring machine health. Each sensor samples at 1 kHz and sends FFT results (256 frequency bins, each a 32-bit float) every second. The system requires:

Less than 100ms latency for anomaly detection
99.99% reliability (equipment damage/downtime is extremely expensive)
Local edge processing before cloud upload

Question: What format would you use for sensor-to-edge communication vs edge-to-cloud communication? Justify your choices.

Answer

Two-Tier Format Strategy:

Sensor-to-Edge: Custom Binary or FlatBuffers

Payload: 256 floats x 4 bytes = 1,024 bytes per message
Messages: 200 sensors x 1/sec = 200 messages/sec = 204.8 KB/sec
Latency requirement: <100ms means format must be ultra-fast to parse

Why Custom Binary/FlatBuffers:

Zero-copy access - read floats directly from buffer, no parsing
Predictable latency - no variable-length decoding surprises
Minimal CPU overhead - critical when processing 200 streams simultaneously
FlatBuffers advantage - provides schema evolution if needed later

Edge-to-Cloud: Protocol Buffers with Delta Compression

After edge processing, only anomalies and aggregates are sent:

Normal operation: 1 aggregate per minute per sensor (summary stats)
Anomaly detected: Full FFT spectrum + alerts

Why Protobuf:

Schema enforcement - cloud and edge must agree on data format
Compression-friendly - repeated similar values compress well
Language-agnostic - edge might be C++, cloud is Python/Java
gRPC integration - natural fit for edge-to-cloud RPC patterns

Bandwidth comparison:

Sensor to Edge Raw data: 204.8 KB/s With strategy: 204.8 KB/s Savings: 0% (local)
Edge to Cloud (normal) Raw data: 204.8 KB/s With strategy: 2 KB/s (aggregates) Savings: 99%
Edge to Cloud (anomaly) Raw data: 204.8 KB/s With strategy: 10 KB/s (selected FFTs) Savings: 95%

Key insight: Format choice depends on communication path. Local high-speed links can afford larger payloads; cloud links need aggressive optimization.

11.5 Worked Example: Payload Design for Battery-Constrained Sensor

11.6 Worked Example: Optimizing Payload Size for LoRaWAN Sensor

Scenario: You are designing a soil moisture sensor for precision agriculture. The sensor runs on 2x AA batteries and must operate for 2+ years. It connects via LoRaWAN (max payload 51 bytes in SF12 mode) and sends readings every 15 minutes. Each reading includes: soil moisture (0-100%), temperature (-40 to +60C), battery voltage (2.0-3.6V), and timestamp.

Goal: Design the most efficient payload format that fits LoRaWAN constraints while maximizing battery life.

What we do: Determine the precision and range needed for each data field.

Why: Choosing appropriate precision prevents wasting bytes on unnecessary accuracy.

Field analysis:

Soil moisture Range: 0-100% Required precision: 0.5% resolution Example value: 42.5%
Temperature Range: -40 to +60C Required precision: 0.1C resolution Example value: 23.7C
Battery voltage Range: 2.0-3.6V Required precision: 0.01V resolution Example value: 3.24V
Timestamp Range: Unix epoch Required precision: 1-second resolution Example value: 1702732800

Key insight: We don’t need floating point! All values can be encoded as integers with implied decimal places:

42.5% becomes 425 (divide by 10 on receive)
23.7C becomes 237 (divide by 10 on receive)
3.24V becomes 324 (divide by 100 on receive)

What we do: Calculate payload size for JSON, CBOR, and custom binary formats.

Why: Size directly impacts battery life (transmission energy) and LoRaWAN feasibility.

JSON approach (human-readable):

{"m":42.5,"t":23.7,"v":3.24,"ts":1702732800}

Size: 44 bytes
Pros: Debuggable, self-describing
Cons: Quotes, colons, braces waste 20+ bytes

CBOR approach (binary JSON):

A4                          # Map with 4 items
  61 6D                     # "m" (1 char)
  F9 4528                   # 42.5 as float16
  61 74                     # "t" (1 char)
  F9 43C3                   # 23.7 as float16
  61 76                     # "v" (1 char)
  F9 40A3                   # 3.24 as float16
  62 7473                   # "ts" (2 chars)
  1A 657DA400               # 1702732800 as uint32

Size: 22 bytes
Pros: Self-describing, standard format
Cons: Field names still consume bytes

Custom binary approach (maximum efficiency):

Byte layout:
[0-1]:   moisture = 425 (uint16, value/10 = 42.5%)
[2-3]:   temperature = 637 (uint16 with offset, (val-400)/10 = 23.7C)
[4-5]:   battery = 324 (uint16, value/100 = 3.24V)
[6-9]:   timestamp (uint32, seconds since epoch)

Hex: 01A9 027D 0144 657DA400

Size: 10 bytes (77% smaller than JSON!)
Pros: Minimal size, maximum battery life
Cons: Not self-describing, requires documentation

What we do: Show exact byte-by-byte calculations for the custom binary format.

Why: Understanding the encoding enables implementation and debugging.

Custom binary encoding detail:

Moisture Raw value: 42.5% Encoding: 42.5 x 10 = 425 Bytes: 2 Hex: 0x01A9
Temperature Raw value: 23.7C Encoding: (23.7 + 40) x 10 = 637 Bytes: 2 Hex: 0x027D
Battery Raw value: 3.24V Encoding: 3.24 x 100 = 324 Bytes: 2 Hex: 0x0144
Timestamp Raw value: 1702732800 Encoding: Direct uint32 Bytes: 4 Hex: 0x657DA400
Total Total bytes: 10

Temperature encoding trick: Adding 40 offset makes -40C = 0, allowing unsigned integer storage:

-40C becomes (-40+40) x 10 = 0
+60C becomes (60+40) x 10 = 1000
Range 0-1000 fits in uint16 (0-65535)

Decoding formula (receiver side):

def decode_sensor_payload(data):
    moisture = struct.unpack('>H', data[0:2])[0] / 10.0      # 425 -> 42.5%
    temp_raw = struct.unpack('>H', data[2:4])[0]
    temperature = (temp_raw / 10.0) - 40.0                    # 637 -> 23.7C
    battery = struct.unpack('>H', data[4:6])[0] / 100.0      # 324 -> 3.24V
    timestamp = struct.unpack('>I', data[6:10])[0]           # Direct
    return moisture, temperature, battery, timestamp

What we do: Quantify how payload size affects battery life.

Why: For LoRaWAN devices, transmission energy dominates power budget.

Putting Numbers to It

Transmission energy scales with total packet size for LoRaWAN. The total packet includes payload plus protocol overhead (LoRaWAN header, CRC, etc.). For SF12 at 14dBm, approximate energy is $E_{\text{tx}} = (n_{\text{total\_bytes}}) \times 0.5\,\text{mJ/byte}$ where total bytes includes ~60 bytes of protocol overhead plus your payload. Worked example: A 44-byte JSON payload becomes 104 total bytes (44 + 60 overhead) requiring $104 \times 0.5 = 52\,\text{mJ}$ per transmission, while a 10-byte binary payload becomes 70 total bytes requiring $70 \times 0.5 = 35\,\text{mJ}$, saving 33% energy per message. Over 2-3 years at 96 messages/day, this extends battery life by months.

LoRaWAN transmission energy (SF12, 14dBm):

Base overhead: ~60 bytes (header, CRC, etc.)
Energy per byte: ~0.5 mJ (millijoules)

Daily energy consumption:

JSON Payload: 44 bytes Total TX: 104 bytes Energy/message: 52 mJ Messages/day: 96 Energy/day: 4,992 mJ
CBOR Payload: 22 bytes Total TX: 82 bytes Energy/message: 41 mJ Messages/day: 96 Energy/day: 3,936 mJ
Binary Payload: 10 bytes Total TX: 70 bytes Energy/message: 35 mJ Messages/day: 96 Energy/day: 3,360 mJ

Battery life calculation (2x AA = ~2500 mAh = 27,000 J at 3V nominal):

JSON Daily energy: 4,992 mJ Battery life: 5,413 days (14.8 years) Improvement: Baseline
CBOR Daily energy: 3,936 mJ Battery life: 6,861 days (18.8 years) Improvement: +27%
Binary Daily energy: 3,360 mJ Battery life: 8,036 days (22.0 years) Improvement: +49%

Reality check: These calculations assume only TX energy. Real battery life is 2-3 years due to sensor, MCU, and leakage. But relative differences still apply!

Interactive Calculator: LoRaWAN Battery Life

Show code

viewof payload_size = Inputs.range([5, 100], {value: 10, step: 1, label: "Payload size (bytes)"})
viewof msg_interval = Inputs.range([5, 120], {value: 15, step: 5, label: "Message interval (minutes)"})
viewof battery_capacity = Inputs.range([1000, 5000], {value: 2000, step: 100, label: "Battery capacity (mAh)"})
viewof battery_voltage = Inputs.range([2.0, 3.6], {value: 3.0, step: 0.1, label: "Battery voltage (V)"})

Show code

battery_calc = {
  const overhead_bytes = 60;
  const total_bytes = payload_size + overhead_bytes;
  const energy_per_msg = total_bytes * 0.5; // mJ
  const msgs_per_day = (24 * 60) / msg_interval;
  const energy_per_day = msgs_per_day * energy_per_msg;
  const battery_energy = battery_capacity * battery_voltage * 3.6; // Convert mAh*V to Joules (1 mAh = 3.6 J at 1V)
  const days = battery_energy / energy_per_day;
  const years = days / 365;
  return {
    total_bytes: total_bytes,
    energy_per_msg: energy_per_msg,
    msgs_per_day: msgs_per_day,
    energy_per_day: energy_per_day,
    battery_energy: battery_energy,
    days: days,
    years: years
  };
}

Show code

html`<div style="background: linear-gradient(135deg, #f8f9fa 0%, #e9ecef 100%); padding: 1.5rem; border-radius: 8px; border-left: 4px solid #16A085; margin-top: 1rem; box-shadow: 0 2px 4px rgba(0,0,0,0.1);">
<h4 style="margin-top: 0; color: #2C3E50;">Battery Life Estimation</h4>
<div style="display: grid; gap: 0.75rem; background: white; border-radius: 4px; padding: 0.75rem;">
  <div style="display:flex; justify-content:space-between; gap:1rem; flex-wrap:wrap; border-bottom: 1px solid #ddd; padding-bottom:0.5rem;">
    <strong>Total packet size:</strong>
    <span>${battery_calc.total_bytes} bytes (${payload_size} payload + ${60} overhead)</span>
  </div>
  <div style="display:flex; justify-content:space-between; gap:1rem; flex-wrap:wrap; border-bottom: 1px solid #ddd; padding-bottom:0.5rem;">
    <strong>Energy per message:</strong>
    <span>${battery_calc.energy_per_msg.toFixed(1)} mJ</span>
  </div>
  <div style="display:flex; justify-content:space-between; gap:1rem; flex-wrap:wrap; border-bottom: 1px solid #ddd; padding-bottom:0.5rem;">
    <strong>Messages per day:</strong>
    <span>${battery_calc.msgs_per_day.toFixed(0)}</span>
  </div>
  <div style="display:flex; justify-content:space-between; gap:1rem; flex-wrap:wrap; border-bottom: 1px solid #ddd; padding-bottom:0.5rem;">
    <strong>Daily TX energy:</strong>
    <span>${battery_calc.energy_per_day.toFixed(0)} mJ</span>
  </div>
  <div style="display:flex; justify-content:space-between; gap:1rem; flex-wrap:wrap; border-bottom: 1px solid #ddd; padding-bottom:0.5rem;">
    <strong>Battery capacity:</strong>
    <span>${(battery_calc.battery_energy / 1000).toFixed(1)} kJ</span>
  </div>
  <div style="display:flex; justify-content:space-between; gap:1rem; flex-wrap:wrap; background: #e8f5e9; border-radius: 4px; padding:0.75rem;">
    <strong>Theoretical battery life:</strong>
    <span style="font-size: 1.1em; color: #16A085;"><strong>${battery_calc.years.toFixed(1)} years</strong> (${battery_calc.days.toFixed(0)} days)</span>
  </div>
</div>
<p style="margin: 1rem 0 0 0; font-size: 0.9em; color: #7F8C8D;"><em>Note: Real battery life is 2-3 years due to sensor, MCU, and self-discharge. This calculator only shows TX energy impact.</em></p>
</div>`

Outcome: Custom binary format maximizes battery life within LoRaWAN constraints.

Final payload specification:

+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Byte 0 | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Byte 7 | Byte 8 | Byte 9 |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|    Moisture     |   Temperature   |    Battery      |           Timestamp             |
|    (uint16)     | (uint16+offset) |    (uint16)     |            (uint32)             |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+

Encoding:
- Moisture: value x 10 (42.5% becomes 425)
- Temperature: (value + 40) x 10 (23.7C becomes 637)
- Battery: value x 100 (3.24V becomes 324)
- Timestamp: Unix epoch seconds (big-endian)

Total: 10 bytes (fits LoRaWAN SF12 with room to spare)

Key design decisions:

No field names - Documentation-based format (77% size reduction)
Fixed-point integers - Avoid float overhead, preserve precision
Temperature offset - Enables unsigned encoding of negative values
Big-endian - Network byte order for consistency
Timestamp included - Handles network delays, out-of-order delivery

Implementation note:

// ESP32/Arduino encoding
void encode_payload(uint8_t* buf, float moisture, float temp, float battery, uint32_t ts) {
    uint16_t m = (uint16_t)(moisture * 10);
    uint16_t t = (uint16_t)((temp + 40) * 10);
    uint16_t v = (uint16_t)(battery * 100);
    buf[0] = m >> 8; buf[1] = m & 0xFF;
    buf[2] = t >> 8; buf[3] = t & 0xFF;
    buf[4] = v >> 8; buf[5] = v & 0xFF;
    buf[6] = ts >> 24; buf[7] = (ts >> 16) & 0xFF;
    buf[8] = (ts >> 8) & 0xFF; buf[9] = ts & 0xFF;
}

11.7 Additional Knowledge Check

Knowledge Check: Data Format Selection Quick Check

Concept: Choosing the right data format for IoT applications.

11.8 Visual Reference Gallery

The following AI-generated figures provide alternative visual perspectives on data format concepts.

JSON vs XML Comparison (Alternative View)

Artistic side-by-side comparison of JSON and XML data formats showing identical sensor data represented in both formats, highlighting structural differences, verbosity levels, and parsing complexity

JSON and XML Format Comparison

JSON and XML are the two most widely used text-based data formats in IoT. This visualization contrasts their structural approaches: XML uses hierarchical tags with explicit opening and closing elements, while JSON uses a more compact key-value notation with curly braces and square brackets. For IoT applications, JSON typically offers 30-50% smaller payloads than equivalent XML, faster parsing on resource-constrained devices, and broader support in modern programming languages and cloud platforms.

Serialization Process (Alternative View)

Geometric diagram showing the serialization process from in-memory data structures to byte stream representation, with deserialization shown as the reverse process

Data Serialization Process Visualization

Serialization transforms structured data (objects, arrays, sensor readings) into a sequential byte stream for transmission or storage. This process is fundamental to IoT communication: sensor readings in memory must be serialized before network transmission, then deserialized at the receiving end. The choice of serialization format directly impacts transmission size, parsing speed, and interoperability between heterogeneous IoT devices and cloud services.

Data Encoding Formats Overview (Alternative View)

Geometric overview comparing major IoT data encoding formats including JSON, CBOR, MessagePack, Protocol Buffers, and custom binary, showing relative size, parsing speed, and use case recommendations

Data Encoding Formats for IoT

IoT systems can choose from multiple data encoding formats, each with distinct trade-offs. Text formats (JSON, XML) prioritize human readability and debugging ease. Binary formats (CBOR, MessagePack, Protocol Buffers) optimize for compact size and parsing speed. Custom binary formats offer maximum efficiency but sacrifice interoperability. This visualization helps engineers select the appropriate format based on their bandwidth constraints, processing power, and ecosystem requirements.

Common Mistake: Forgetting to Account for Base64 Encoding Overhead

The Problem:

Many IoT developers optimize their binary format to 10 bytes, celebrate the efficiency, then discover their cloud ingestion pipeline requires Base64 encoding for JSON/REST APIs – adding 33% overhead.

Real Scenario:

Your LoRaWAN sensor sends a perfectly optimized 10-byte binary payload: - Raw binary: 10 bytes - Base64 encoded: 16 bytes (ceil(10/3) x 4 = 16, including padding) - JSON wrapper: {"data":"AQIDBAUG...=="} = 16 + ~14 (quotes/braces/field name) = ~30 bytes

Impact:

Expected: 10 bytes
Actual transmitted: 10 bytes (good!)
Actual stored/forwarded: ~30 bytes (200% overhead!)
LoRaWAN → Cloud gateway bandwidth: 3x larger than expected

Why This Happens:

Base64 encodes 3 bytes as 4 ASCII characters. Most cloud APIs (AWS IoT, Azure IoT Hub, Google Cloud IoT) require JSON, and JSON can’t natively represent binary data, forcing Base64 encoding.

The Solution:

Account for it upfront: Budget 33% overhead when estimating cloud storage/bandwidth costs
Use binary-native protocols: MQTT with binary payloads, CoAP with CBOR, or gRPC
Compress before Base64: If payload is compressible (sensor logs), gzip first
Avoid double-encoding: Don’t Base64 encode, then hex encode, then Base64 again (yes, this happens!)

Example Calculation (1,000 sensors, 96 msgs/day, 10-byte payload):

Sensor → Gateway Raw binary: 10 bytes Base64+JSON: 10 bytes Annual data: 350 MB Cost at $0.10/MB: $35
Gateway → Cloud Raw binary: 10 bytes Base64+JSON: 24 bytes Annual data: 841 MB Cost at $0.10/MB: $84

Savings opportunity: Use a binary-native cloud ingestion protocol to save $49/year per 1,000 sensors.

Interactive Calculator: Base64 Overhead Impact

Show code

viewof binary_payload = Inputs.range([5, 100], {value: 10, step: 1, label: "Binary payload size (bytes)"})
viewof num_sensors = Inputs.range([100, 10000], {value: 1000, step: 100, label: "Number of sensors"})
viewof msgs_day = Inputs.range([10, 200], {value: 96, step: 10, label: "Messages per day per sensor"})
viewof cloud_cost_mb = Inputs.range([0.05, 0.20], {value: 0.10, step: 0.01, label: "Cloud ingestion cost ($/MB)"})

Show code

base64_calc = {
  const base64_size = Math.ceil(binary_payload / 3) * 4;
  const json_wrapper_overhead = 14; // {"data":"..."} - field name and syntax
  const total_size = base64_size + json_wrapper_overhead;
  const overhead_percent = ((total_size - binary_payload) / binary_payload) * 100;

  const annual_msgs = num_sensors * msgs_day * 365;
  const binary_annual_mb = (annual_msgs * binary_payload) / (1024 * 1024);
  const wrapped_annual_mb = (annual_msgs * total_size) / (1024 * 1024);

  const binary_cost = binary_annual_mb * cloud_cost_mb;
  const wrapped_cost = wrapped_annual_mb * cloud_cost_mb;
  const wasted_cost = wrapped_cost - binary_cost;

  return {
    base64_size: base64_size,
    total_size: total_size,
    overhead_percent: overhead_percent,
    binary_annual_mb: binary_annual_mb,
    wrapped_annual_mb: wrapped_annual_mb,
    binary_cost: binary_cost,
    wrapped_cost: wrapped_cost,
    wasted_cost: wasted_cost
  };
}

Show code

html`<div style="background: linear-gradient(135deg, #fff3cd 0%, #ffe69c 100%); padding: 1.5rem; border-radius: 8px; border-left: 4px solid #E67E22; margin-top: 1rem; box-shadow: 0 2px 4px rgba(0,0,0,0.1);">
<h4 style="margin-top: 0; color: #2C3E50;">Base64 Encoding Overhead</h4>
<div style="display: grid; gap: 0.75rem; background: white; border-radius: 4px; padding: 0.75rem;">
  <div style="padding: 0.75rem; border: 1px solid #ddd; border-radius: 6px;">
    <div style="font-weight: 700; color: #2C3E50;">Binary payload</div>
    <div style="margin-top: 0.25rem;">${binary_payload} bytes</div>
  </div>
  <div style="padding: 0.75rem; border: 1px solid #ddd; border-radius: 6px; background: #fff9e6;">
    <div style="font-weight: 700; color: #2C3E50;">Base64 encoded</div>
    <div style="margin-top: 0.25rem;">${base64_calc.base64_size} bytes (+${((base64_calc.base64_size - binary_payload) / binary_payload * 100).toFixed(0)}%)</div>
  </div>
  <div style="padding: 0.75rem; border: 1px solid #ddd; border-radius: 6px; background: #fff3cd;">
    <div style="font-weight: 700; color: #2C3E50;">With JSON wrapper</div>
    <div style="margin-top: 0.25rem;"><strong>${base64_calc.total_size} bytes (+${base64_calc.overhead_percent.toFixed(0)}% total)</strong></div>
  </div>
  <div style="padding: 0.75rem; border: 1px solid #ddd; border-radius: 6px; background: #f8f9fa;">
    <div style="font-weight: 700; color: #2C3E50; margin-bottom: 0.5rem;">Annual Cost Impact</div>
    <div style="display: grid; gap: 0.5rem;">
      <div>
        <div style="font-weight: 700;">Binary-native protocol</div>
        <div>${base64_calc.binary_annual_mb.toFixed(1)} MB/year = $${base64_calc.binary_cost.toFixed(2)}</div>
      </div>
      <div>
        <div style="font-weight: 700;">Base64+JSON wrapper</div>
        <div>${base64_calc.wrapped_annual_mb.toFixed(1)} MB/year = $${base64_calc.wrapped_cost.toFixed(2)}</div>
      </div>
      <div style="padding-top: 0.5rem; border-top: 1px solid #ddd;">
        <div style="font-weight: 700;">Wasted cost</div>
        <div style="margin-top: 0.25rem; font-size: 1.1em; color: #E67E22;"><strong>$${base64_calc.wasted_cost.toFixed(2)}/year</strong></div>
      </div>
    </div>
  </div>
</div>
</div>`

11.9 How It Works: From Sensor Reading to Cloud Ingestion

Understanding the complete data format pipeline helps you make informed encoding decisions. Here’s how a sensor reading travels from measurement to cloud storage.

Step 1: Sensor Measurement The temperature sensor produces an analog voltage (e.g., 2.35V representing 23.5°C). The ADC converts this to a digital value (e.g., 2350 at 0.01V resolution).

Step 2: In-Memory Representation The microcontroller stores this as a float variable temperature = 23.5 (4 bytes in RAM) or could optimize to uint16_t temp_raw = 235 (2 bytes, implied decimal).

Step 3: Format Selection Decision The firmware chooses an encoding format based on: - Connectivity: LoRaWAN (51-byte limit) → Custom binary. WiFi (unlimited) → JSON. - Scale: 10 sensors → JSON is fine. 10,000 sensors → CBOR/Protobuf saves $thousands/year. - Lifespan: 6-month deployment → JSON. 10-year deployment → Protobuf for schema evolution.

Step 4: Serialization The firmware serializes the reading: - JSON: {"device":"A1","temp":23.5,"ts":1702732800} → 44 bytes - CBOR: Binary encoding of same structure → 22 bytes - Custom Binary: Device ID (2 bytes) + temp×10 (2 bytes) + timestamp (4 bytes) → 8 bytes

Step 5: Transmission The serialized payload is transmitted via the chosen protocol: - LoRaWAN adds 13-byte header → Total: 21 bytes (for 8-byte payload) - MQTT adds variable header (~5 bytes) + topic name (~15 bytes) → Total: ~28 bytes - CoAP adds 4-byte header → Total: 12 bytes

Step 6: Gateway Processing The network gateway may transcode the format: - LoRaWAN gateway: Forwards raw 8-byte binary unchanged (good!) - Legacy REST API: Wraps binary in Base64+JSON → 24 bytes (bad! 3x inflation) - Modern binary protocol: Keeps binary encoding → 8 bytes (good!)

Step 7: Cloud Ingestion The cloud service deserializes and stores: - Time-series DB (InfluxDB, TimescaleDB): Optimized columnar storage, ~2-3 bytes/reading long-term - Document DB (MongoDB, DynamoDB): JSON storage, ~40-60 bytes/reading - Data lake (S3 Parquet): Highly compressed, ~1 byte/reading in aggregate files

Key Insight: The format you choose in Step 4 affects battery life (Step 5), gateway costs (Step 6), and cloud storage expenses (Step 7). A 10-byte binary payload that gets Base64-wrapped becomes 24 bytes in cloud storage – always trace the full pipeline!

Example End-to-End Bandwidth:

Sensor→Gateway JSON: 57 bytes (44-byte payload + 13-byte header) CBOR: 35 bytes (22-byte payload + 13-byte header) Binary: 21 bytes (8-byte payload + 13-byte header)
Gateway→Cloud (if Base64-wrapped) JSON: 44 bytes (JSON pass-through) CBOR: 30 bytes (Base64 wrapped) Binary: 12 bytes (Base64 wrapped)
Cloud Storage (TimescaleDB) JSON: about 40 bytes per JSON document CBOR: about 2.5 bytes when decomposed into columns Binary: about 2.5 bytes when decomposed into columns

What to Observe: Optimizing transmission (sensor→gateway) saves battery. Optimizing cloud ingestion (gateway→cloud) saves bandwidth costs. Optimizing storage format saves long-term costs. You need to consider ALL three.

11.10 Concept Relationships

Understanding how data format concepts interrelate helps you navigate trade-offs:

JSON Depends on: text encoding (UTF-8) and a key-value data model Enables: human-readable debugging and web APIs Common confusion: “JSON is always slower” — false for small payloads with fast parsers
CBOR Depends on: binary encoding and the JSON data model Enables: 30-60% size reduction and stronger typing than JSON Common confusion: “CBOR is the same as JSON” — it is a binary encoding with additional types
Protocol Buffers Depends on: schema definitions (.proto files) and code generation Enables: type safety, schema evolution, and tooling support Common confusion: “Protobuf is too complex for IoT” — the upfront cost pays off at larger scale
Custom Binary Depends on: bit-level encoding decisions and careful documentation Enables: maximum efficiency and domain-specific optimization Common confusion: “Custom binary is always best” — that ignores maintenance, debugging, and interoperability costs
Base64 Encoding Depends on: ASCII representation of binary data Enables: embedding binary content in text protocols such as JSON and XML Common confusion: “Base64 compresses data” — it expands data by about 33%
Schema Evolution Depends on: versioning strategy and backward compatibility planning Enables: long-term deployments that can evolve safely over 5-10 years Common confusion: “Add a version byte to custom binary and you’re done” — migration logic is still required
Endianness Depends on: byte order decisions (big-endian vs little-endian) Enables: correct multi-byte integer serialization Common confusion: “Endianness only matters for multi-platform systems” — it matters for any binary format
Payload Size Depends on: encoding format, data precision, and metadata overhead Enables: lower transmission energy and lower airtime cost Common confusion: “Smaller is always better” — parsing CPU cost can still matter on tiny devices

How These Concepts Work Together:

Choose format family (text vs binary) based on bandwidth constraints
Within binary formats, choose self-describing (CBOR) vs schema-based (Protobuf) vs custom based on scale and lifespan
Account for Base64 encoding if crossing text-based APIs (adds 33% overhead)
Plan schema evolution strategy upfront for deployments lasting >2 years
Document endianness and precision decisions for any custom binary format
Calculate total pipeline cost (sensor→gateway→cloud→storage), not just over-the-air size

Critical Decision Points:

If your deployment has tight bandwidth limits such as LoRaWAN or NB-IoT: Prioritize custom binary or CBOR because payload size directly affects battery life.
If your deployment has 10,000+ devices: Prioritize Protocol Buffers or CBOR because schema validation prevents data corruption at scale.
If your deployment has a 10-year lifespan: Prioritize Protocol Buffers with versioning because schema evolution is inevitable.
If your deployment uses WiFi or Ethernet: Prioritize JSON or CBOR because bandwidth is cheap and debugging ease matters more.
If your deployment spans mixed platforms from Arduino to Python to Java: Prioritize Protobuf or CBOR because language-agnostic formats reduce vendor lock-in.
If your deployment changes schema frequently: Prioritize JSON or CBOR because self-describing formats tolerate ad-hoc fields better.

11.11 Format Matching Challenge

Match each IoT data format to its defining characteristic:

11.12 Format Selection Decision Process

Place the steps for selecting an IoT data format in the correct order:

Common Pitfalls

1. Encoding Sensor Readings as JSON Strings Instead of Numbers

Sending {‘temperature’: ‘23.5’} as a string instead of {‘temperature’: 23.5} as a number forces consumers to parse strings, breaks numeric queries, and increases message size. Validate that all numeric fields are encoded as JSON numbers — this is the most common IoT data format error.

2. Ignoring Schema Evolution When Updating Firmware

Adding a new field to a JSON/CBOR message immediately after a firmware update means old consumers (not yet updated) receive unexpected fields. Use additive-only schema changes (never remove or rename fields), version your schemas, and handle unknown fields gracefully in all consumers.

3. Using Floating Point for All Numeric Fields

IEEE 754 float64 uses 8 bytes per value — a 10-field sensor reading becomes 80 bytes just for numbers. Integer-scaled fixed point (temperature × 100 as int16) reduces the same reading to 2 bytes per value with identical precision for typical IoT ranges.

Label the Diagram

11.13 Summary

Key Takeaways from Practice:

Context matters more than raw size - JSON is often fine for Wi-Fi deployments
Calculate total cost of ownership - Include development, maintenance, and debugging costs
Plan for evolution - Schema changes are inevitable over device lifetimes
Use two-tier strategies - Different formats for local vs cloud communication
Migration requires careful planning - Never remove support for old formats immediately

Practice Checklist:

Can you calculate payload sizes for JSON, CBOR, Protobuf, and custom binary?
Can you design a custom binary encoding for given sensor data?
Can you justify format choices with quantitative analysis?
Can you plan a migration strategy from JSON to binary formats?

For Kids: Meet the Sensor Squad!

Sammy the Sensor has a problem. “I need to send my temperature reading to the cloud, but I only have a tiny envelope to put it in!”

Max the Microcontroller grins. “That’s like choosing how to write a letter! You could write a long, fancy letter with beautiful handwriting (that’s JSON), or you could write a short coded message (that’s binary)!”

Lila the LED holds up examples:

“Dear Cloud, the temperature is 23.5 degrees” = 32 bytes (like JSON – easy to read but long!)
“T:23.5” = 6 bytes (like a text message – shorter!)
Just the number “235” = 2 bytes (like a secret code – super tiny!)

Bella the Battery does a happy dance. “I LOVE the short version! The less Sammy has to transmit, the longer I last! With the tiny code, I can live for YEARS!”

“But wait,” says Sammy, “won’t the cloud be confused by just a number?”

“That’s why we agree on the code beforehand,” Max explains. “We tell the cloud: ‘the first two bytes are temperature times 10.’ It’s like having a secret decoder ring!”

The Squad’s Rule: Pick the format that fits your envelope! Big envelope (Wi-Fi)? Write fancy letters (JSON). Tiny envelope (LoRaWAN)? Use your secret code (binary)!

11.14 What’s Next

Now that you’ve practiced data format selection and design, explore these related topics:

Protocol headers Chapter: Packet Structure and Framing Why it matters: See how your chosen data format fits inside protocol packets and framing layers.
MQTT messaging Chapter: MQTT Fundamentals Why it matters: Implement JSON and CBOR payloads with the most widely used IoT messaging protocol.
Protocol selection Chapter: Protocol Selector Wizard Why it matters: Get interactive recommendations for your specific deployment scenario.
Data formats overview Chapter: IoT Data Formats Overview Why it matters: Review the foundational concepts of text-based formats like JSON and XML.
Binary formats Chapter: Binary Data Formats Why it matters: Revisit CBOR, Protocol Buffers, and custom binary encoding details.
CoAP protocol Chapter: CoAP Fundamentals Why it matters: Explore the lightweight protocol designed for constrained IoT devices using CBOR.