47  Text Encoding for IoT

47.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Work with text encoding: Describe ASCII, Unicode, and UTF-8 for IoT applications
  • Choose efficient formats: Compare JSON, CBOR, and binary encodings for bandwidth-constrained IoT
  • Avoid encoding pitfalls: Handle multi-byte characters and buffer sizing correctly
  • Understand sensor data formats: Distinguish between integer and floating-point representations

Foundation Topics: - Data Representation Fundamentals - Overview and index - Number Systems and Data Units - Binary, decimal, hexadecimal - Bitwise Operations and Endianness - Bit manipulation and byte ordering - Data Formats for IoT - JSON, CBOR, and binary formats

Apply These Concepts: - Packet Structure and Framing - How data is packaged - Sensor to Network Pipeline - End-to-end data flow - Protocol Selection Framework - Choosing the right protocol

Learning Hubs: - Quiz Navigator - Test your understanding - Simulation Playground - Interactive tools

47.2 Prerequisites

Before reading this chapter, you should understand:


47.3 Text Encoding for IoT

⏱️ ~7 min | ⭐⭐ Intermediate | 📋 P02.C01.U03

Text must be converted to numbers for computers to process. This is critical for: - Device names and identifiers - Configuration files and JSON messages - Serial monitor output and debugging - MQTT topics and CoAP URIs

47.3.1 ASCII - The Original Standard

ASCII (American Standard Code for Information Interchange) uses 7 bits to encode 128 characters:

Range Characters Examples
0-31 Control codes NULL (0), LF (10), CR (13)
32-47 Punctuation/symbols Space (32), ! (33), / (47)
48-57 Digits ‘0’ (48), ‘5’ (53), ‘9’ (57)
65-90 Uppercase letters ‘A’ (65), ‘Z’ (90)
97-122 Lowercase letters ‘a’ (97), ‘z’ (122)

Example: The string "IoT" in ASCII: - ‘I’ = 73 = 0x49 = 0b01001001 - ‘o’ = 111 = 0x6F = 0b01101111 - ‘T’ = 84 = 0x54 = 0b01010100

Limitation: ASCII only covers English. No accents (e, n), no emojis, no Chinese/Arabic characters.

47.3.2 Unicode and UTF-8

Unicode assigns a unique number to every character across all languages (143,000+ characters): - Latin: A-Z, a-y - Cyrillic: A-Ya - Chinese: characters - Emojis: thermometer, antenna, battery

UTF-8 (Universal Transformation Format-8) is the most common Unicode encoding:

Characters Bytes Used Example
ASCII (A-Z, 0-9) 1 byte ‘A’ = 0x41
Latin extended (e, n) 2 bytes ‘e’ = 0xC3 0xA9
Most other (Chinese, Hebrew) 3 bytes Chinese character = 0xE4 0xB8 0xAD
Emojis 4 bytes Thermometer emoji = 0xF0 0x9F 0x8C 0xA1

Why UTF-8 for IoT? - Backward compatible with ASCII: 1 byte = 1 character for English - Efficient: No wasted space for common characters - Universal: Works with MQTT, JSON, HTTP - Example: Device name “Sensor_01” = 10 bytes, “Capteur_01” = 11 bytes (e adds 1)


47.4 Data Format Efficiency

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D'}}}%%
graph TB
    subgraph SENSOR["Sensor Reading: 23.5C"]
        DATA["Temperature: 23.5<br/>Humidity: 65%<br/>Status: OK"]
    end

    subgraph JSON["JSON Format"]
        J1["{'temp':23.5,'hum':65,'ok':true}"]
        J2["<b>32 bytes</b><br/>Human readable<br/>High overhead"]
    end

    subgraph CBOR["CBOR Format"]
        C1["A3 64 74656D70 FB... 62 68756D 18 41 62 6F6B F5"]
        C2["<b>18 bytes</b><br/>Binary JSON<br/>44% smaller"]
    end

    subgraph BINARY["Raw Binary"]
        B1["00 EB 41 01"]
        B2["<b>4 bytes</b><br/>Most compact<br/>87% smaller"]
    end

    subgraph COMPARE["IoT Impact"]
        COMP1["100 devices x 10 msgs/hr"]
        COMP2["JSON: 77 KB/day"]
        COMP3["CBOR: 43 KB/day"]
        COMP4["Binary: 10 KB/day"]
    end

    DATA --> JSON
    DATA --> CBOR
    DATA --> BINARY

    JSON --> COMPARE
    CBOR --> COMPARE
    BINARY --> COMPARE

    style J1 fill:#E67E22,stroke:#2C3E50,stroke-width:1px,color:#fff
    style J2 fill:#E67E22,stroke:#2C3E50,stroke-width:1px,color:#fff
    style C1 fill:#16A085,stroke:#2C3E50,stroke-width:1px,color:#fff
    style C2 fill:#16A085,stroke:#2C3E50,stroke-width:1px,color:#fff
    style B1 fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff
    style B2 fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff
    style COMP4 fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff

Figure 47.1: Data Format Efficiency Comparison - The same sensor reading encoded in different formats shows dramatic size differences. JSON (orange) is human-readable but verbose at 32 bytes. CBOR (teal) provides JSON semantics in binary form at 18 bytes. Raw binary (navy) is most compact at 4 bytes but requires schema knowledge. For bandwidth-constrained IoT (LoRaWAN max 51 bytes), format choice directly impacts how many readings fit per message. The trade-off: smaller formats sacrifice readability and flexibility for efficiency. {fig-alt=“Data format efficiency comparison showing same sensor data (Temperature 23.5, Humidity 65%, Status OK) encoded three ways. JSON Format (orange): 32 bytes, human readable, high overhead. CBOR Format (teal): 18 bytes, binary JSON, 44% smaller than JSON. Raw Binary (navy): 4 bytes, most compact, 87% smaller than JSON. IoT Impact section shows 100 devices at 10 messages per hour results in: JSON 77 KB per day, CBOR 43 KB per day, Binary 10 KB per day. Arrows flow from sensor data to all three format options, then to the comparison section.”}

This variant shows which format to choose based on your specific IoT deployment:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D'}}}%%
graph TB
    subgraph HOME["Home Automation<br/>(Wi-Fi)"]
        H1["Use: JSON"]
        H2["Why: Plenty bandwidth<br/>Easy debugging<br/>curl friendly"]
    end

    subgraph WEAR["Wearable Fitness<br/>(Bluetooth LE)"]
        W1["Use: CBOR"]
        W2["Why: 20-byte MTU<br/>Still parseable<br/>Good compromise"]
    end

    subgraph FARM["Agricultural Sensor<br/>(LoRaWAN)"]
        F1["Use: Raw Binary"]
        F2["Why: 51-byte limit!<br/>Every byte counts<br/>Schema in firmware"]
    end

    subgraph FACTORY["Industrial SCADA<br/>(Ethernet)"]
        I1["Use: Modbus/Binary"]
        I2["Why: Legacy systems<br/>Real-time critical<br/>Proven reliability"]
    end

    CENTER["Choose format<br/>by bandwidth<br/>+ debug needs"]

    CENTER --> HOME
    CENTER --> WEAR
    CENTER --> FARM
    CENTER --> FACTORY

    style CENTER fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff
    style H1 fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#fff
    style H2 fill:#ECF0F1,stroke:#E67E22,stroke-width:1px,color:#000
    style W1 fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style W2 fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
    style F1 fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff
    style F2 fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
    style I1 fill:#7F8C8D,stroke:#2C3E50,stroke-width:2px,color:#fff
    style I2 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000

Figure 47.2: Format selection by IoT scenario: match complexity to bandwidth constraints and debugging needs

Why this variant helps: The original shows abstract byte counts. This diagram answers “which format should I use for MY project?” by mapping real IoT scenarios (home, wearable, agriculture, industrial) to appropriate format choices. Students can identify their use case and immediately know which format to consider.


47.5 Common Misconceptions

WarningCommon Misconception: “All Sensors Use Floating Point Numbers”

Myth: “Sensors output floating-point temperature readings like 23.456C”

Reality: Most IoT sensors output integers, and floating point is computed later!

Example: DHT22 Temperature Sensor - Sensor output: 16-bit integer = 2345 (raw value) - Actual temperature: 2345 / 100 = 23.45C (integer arithmetic) - Why? Floating point math is slow and power-hungry on microcontrollers

BME280 Pressure Sensor: - Raw output: 24-bit integer (adc_P = 415148) - Conversion formula: Uses integer arithmetic with calibration constants - Final pressure: 101325 Pa (computed from integer compensation formula)

Practical rule: - Sensor to MCU: Use integers (fast, efficient, no rounding errors) - MCU to Cloud: Convert to float/JSON for human readability - MCU to MQTT: Use binary formats like CBOR to avoid float overhead

Energy impact: Integer-only math can save 10-30% power compared to floating point operations on an 8-bit AVR Arduino.

WarningText Encoding Pitfall in IoT

If your IoT device expects ASCII but receives UTF-8 with accented characters, you’ll see: - Garbled display: “Temperature” becomes “TempA(c)rature” - Buffer overflows: 10-character buffer but UTF-8 string takes 12 bytes - Protocol errors: MQTT topic with emoji fails parsing

Solution: Always use UTF-8 for text in modern IoT systems, and allocate buffers with extra space (1.5x character count).


47.6 Knowledge Check

You’re designing a battery-powered environmental sensor that transmits data via LoRaWAN (51-byte payload limit). The sensor measures temperature (range: -40 to +85C, resolution 0.1C) and humidity (range: 0-100%, resolution 1%).

Question: Which data format and encoding strategy would maximize battery life while fitting within the payload limit?

Question: What is the most efficient encoding for this sensor?

Explanation: C is correct. Binary encoding uses only 3 bytes total: - Temperature: int16 (2 bytes) storing value x10 (e.g., 235 for 23.5C) - Humidity: uint8 (1 byte) storing 0-100

JSON would use ~30 bytes, CBOR ~15 bytes, CSV ~7 bytes. With 3 bytes, you can fit 17 readings per LoRaWAN message instead of just 1-2 with text formats, dramatically reducing transmission count and battery consumption.


47.8 Summary

Text encoding is essential for IoT communication:

  • ASCII: 7-bit encoding for 128 characters (English only, 1 byte per character)
  • Unicode: 143,000+ characters across all languages
  • UTF-8: Variable-width Unicode encoding, backward compatible with ASCII
  • Format Selection: JSON (verbose, debuggable), CBOR (compact, parseable), Binary (smallest, schema-required)
  • Sensor Data: Most sensors output integers, not floats - compute decimals later

Key Takeaways:

  • UTF-8 is the standard for modern IoT text - always use it
  • Buffer sizing: allocate 1.5x character count for UTF-8 safety
  • Data format choice impacts bandwidth, battery life, and debugging ease
  • Match format complexity to your connectivity constraints

47.9 What’s Next

Continue learning about data representation:

Or return to the Data Representation overview for the complete topic map.