7 Text Encoding for IoT

ASCII, Unicode, UTF-8 Byte Sizing, Topics, and Payload Bytes

fundamentals

data

rep

text

7.1 In 60 Seconds

Text becomes bytes before it crosses an IoT boundary. ASCII, Unicode, and UTF-8 are not interchangeable labels: ASCII is a small character set, Unicode names characters, and UTF-8 turns those characters into one to four bytes. A safe text field records the allowed characters, the encoded byte limit, the validation point, and what happens when invalid bytes arrive.

7.2 Start With the Story

Start with a tiny payload that says 0x01F4, where one team reads 500 and another reads the wrong value because the width, order, or encoding was never written down. The core idea in Text Encoding for IoT is simple: data representation turns bits into agreed meaning through units, signedness, byte order, text encoding, and payload contracts. This page focuses that idea on IoT text-encoding : ASCII, Unicode, and UTF-8 byte sizing for topics, identifiers, and payloads, with buffer limits and validation at trust boundaries. In everyday IoT, a meter reading, firmware flag, topic name, or diagnostic string is only trustworthy when both sender and receiver decode it the same way. Start simple: write the value, its units, its byte width, and the decoding rule before adding clever packing or optimization.

7.3 Characters Are Not Bytes

IoT payloads are often described as “text” or “binary,” but the real boundary is sharper: text is characters, while transmission and storage are bytes. A device name, a topic path, a JSON key, or a status message must be encoded before it can be sent. The design question is not only “can a person read it?” but also “how many bytes does it take, and how will both sides validate it?”

Three terms answer different questions. ASCII is a small character set for English-oriented text. Unicode assigns a number, called a code point, to every character. UTF-8 encodes those code points as one to four bytes, and it keeps every ASCII character at a single byte.

If you only need the intuition, this layer is enough: a character is a symbol, a code point is its Unicode number, and UTF-8 turns that number into bytes. ASCII characters stay one byte; many other characters need more. Every text field that crosses a device, gateway, or network boundary needs a character policy and a byte limit, not just a character count.

Think of a label written by two people who never agreed on an alphabet. The same word can occupy a different number of marks for each writer. On the wire, the “marks” are bytes, and only an agreed encoding lets both ends count and read them the same way.

A character moves through several layers: visible character, Unicode code point, UTF-8 bytes, then a packet or storage field.

In real IoT protocols, that byte boundary appears in ordinary places. An MQTT topic, a CoAP URI path segment, a JSON field name, and a device label shown in a dashboard may all begin as readable text, but the packet, broker, gateway, and database store encoded bytes. A design review should therefore ask two questions separately: which characters are allowed, and how many encoded bytes may the field occupy?

The One-Minute View

ASCII

A small encoding for control codes, digits, English letters, and common punctuation. In UTF-8, ASCII characters keep their one-byte values.

Unicode

A character set that assigns code points, such as U+0041 for A and U+00B0 for the degree sign.

UTF-8

A variable-width encoding for Unicode that stores each code point in one, two, three, or four bytes.

Beginner Examples

An ASCII topic such as lab/device/temp uses one byte per visible character in UTF-8.
A device label with non-ASCII characters can look short while using more bytes than its visible length suggests.
A gateway that accepts user-defined identifiers must validate UTF-8 and length limits before forwarding records.

Character Boundary Knowledge Check

Once characters and bytes are separated, text fields can be reviewed like any other protocol field: define the encoding, count encoded bytes, and enforce the boundary before forwarding or storing the value.

7.4 Apply It: Size and Validate Text Fields

UTF-8 does not use one byte per visible character for every string. The encoded length depends on the code points present, so buffers and field limits must be written in encoded bytes.

Make that byte limit explicit in the artifacts that enforce it: firmware constants, gateway validators, broker or API configuration, database schema, and test fixtures. A stable machine identifier can stay deliberately narrow, such as lowercase ASCII plus digits and separators, while a user-facing display name can allow localized UTF-8 under a separate byte ceiling. Keeping those two fields separate prevents a renamed room label from becoming a changed routing key.

UTF-8 byte width grows with the code point: one byte for ASCII, up to four for supplemental code points.

Code Point Range

UTF-8 Width

Common IoT Appearance

Review Note

ASCII

1 byte

Topic separators, JSON punctuation, digits, English keys.

Good default for compact identifiers.

Latin extended and similar

2 bytes

Localized site, owner, or region labels.

Do not size by visible character count.

Many script blocks

3 bytes

International facility or asset names.

Test at least one non-ASCII example if localization is allowed.

Supplemental code points

4 bytes

Special symbols or user-entered labels.

Decide whether these are allowed in device-controlled fields.

Worked Example: Counting Bytes

The string IoT is three ASCII characters, so it is also three UTF-8 bytes:

I -> U+0049 -> 0x49
o -> U+006F -> 0x6F
T -> U+0054 -> 0x54

Now mix scripts. A topic with 18 ASCII characters plus 2 Latin-extended characters is not 20 bytes. The ASCII characters are one byte each and the two non-ASCII characters are two bytes each, so the encoded length is 18 + (2 x 2) = 22 bytes. A “20 character” limit would have been wrong.

Where Text Hides in Messages

Identifiers

Device IDs, asset names, room names, and user labels need a documented character policy and a byte limit.

Topics and paths

Topic strings and URI paths can repeat on every message, so descriptive names should be chosen deliberately.

Structured payloads

JSON keys and string values are UTF-8 text. A field name can be larger than the sensor value it labels.

Diagnostics

Text logs help review, but production messages still need size limits and escaping rules.

Topic and Identifier Design

Treat topic names and identifiers as a small protocol, not free-form notes. Prefer a documented character subset for device-controlled identifiers, such as lowercase letters, digits, hyphen, underscore, and slash where hierarchy is needed. Set limits in encoded bytes, and keep a separate display-name field if humans need localized names. Avoid identifiers that change when a room is renamed or a dashboard label is edited, and define how separators, spaces, quotes, and control characters are rejected or escaped.

UTF-8 Byte Sizing Knowledge Check

A field-size decision is complete only when the allowed character set, encoded byte limit, validation point, and stable identifier/display-name split are written into the system contract.

7.5 Under the Hood: Payload Representation and Safe Decoding

The right representation depends on who reads the payload, how constrained the link is, and whether both endpoints share a schema. Text is not the only option, and binary is not automatically smaller in a useful way.

Encoding choices trade readability, schema discipline, parse speed, and byte size; the best option depends on the payload budget and inspection needs.

Representation

Strength

Trade-off

Best Fit

Readable text

Easy to inspect in logs and test tools.

Repeated keys, quotes, separators, and numeric text add bytes.

Bring-up, debugging, gateways, and roomy message budgets.

Compact structured

Keeps typed fields while cutting text overhead.

Needs tooling that can inspect and validate the encoded structure.

Constrained systems that still need flexible field structure.

Schema-driven binary

Smallest when fields and units are fixed.

Not self-explanatory without the schema, version, units, and byte order.

Repeated telemetry with stable fields and strict byte budgets.

Compare two ways to send a scaled temperature and a status. The binary version is smaller, but only if the receiver already knows the field order, widths, scaling, status-code map, and byte order:

Readable text:
{"temp_tenths":235,"status":"ok"}

Schema-driven bytes:
field 1: signed 16-bit temp_tenths = 235
field 2: unsigned 8-bit status_code = 1

Validate at Trust Boundaries

Use the same checklist for topics, IDs, JSON strings, and other text fields:

Identify the allowed character set and the maximum encoded byte length.
Encode with the declared encoding, usually UTF-8, then count bytes after encoding.
Reject invalid byte sequences on input.
Escape or reject control characters and separators that carry protocol meaning.
Preserve units, scaling, field order, and schema version when text is converted to compact bytes.
Test an ASCII example, a maximum-length example, and at least one allowed non-ASCII example where localization is supported.

Avoid silent repair. Replacing invalid bytes with a placeholder can hide data corruption. For protocol fields, it is usually safer to reject the field and record a validation error than to forward repaired text as if it were correct.

Common Pitfalls

Sizing by visible characters. A “20 character” limit is ambiguous unless the allowed characters and the encoded byte limit are both defined.
Treating display labels as stable IDs. Human labels change; device identifiers and routing paths should be stable protocol fields.
Sending binary without a schema. Small byte arrays are not self-documenting. Record field order, width, signedness, scaling, units, byte order, and version.
Accepting invalid UTF-8. Bad byte sequences break parsers and logs. Validate at trust boundaries before storing or forwarding text.

Under-the-Hood Knowledge Check

At this depth, text encoding is part of the data contract. Use explicit encodings, prefer UTF-8 for interoperability, and remember that character count, byte count, and parsing rules are not the same thing.

7.6 Summary

ASCII, Unicode, and UTF-8 describe different layers of the text-to-byte path.
UTF-8 keeps ASCII at one byte but uses two to four bytes for many other code points.
Buffer and field limits must be written in encoded bytes, not visible characters.
Topic strings and JSON keys can be a meaningful part of small messages.
Compact and binary payloads need schema discipline, versioning, units, and validation.
Validate text at trust boundaries, and reject invalid sequences instead of silently repairing them.

Key Takeaway

Text encoding is part of the data contract. Use explicit encodings, prefer UTF-8 for interoperability, and treat character count, byte count, and parsing rules as separate things.

7 Text Encoding for IoT

7.1 In 60 Seconds

7.2 Start With the Story

7.3 Characters Are Not Bytes

The One-Minute View

ASCII

Unicode

UTF-8

Beginner Examples

Character Boundary Knowledge Check

7.4 Apply It: Size and Validate Text Fields

Worked Example: Counting Bytes

Where Text Hides in Messages

Identifiers

Topics and paths

Structured payloads

Diagnostics

Topic and Identifier Design

UTF-8 Byte Sizing Knowledge Check

7.5 Under the Hood: Payload Representation and Safe Decoding

Validate at Trust Boundaries

Common Pitfalls

Under-the-Hood Knowledge Check

7.6 Summary

7.7 See Also

Bitwise Operations and Endianness

Number Systems and Data Units

Data Formats for IoT