Slicing up a UTF-8 string

30 03 2015

A couple of years ago I had to deal with some low level code that sent a UTF-8 encoded string as packets of bytes. At first I converted to string and stored a concatenation of the result but I got a defect saying that we would sometimes get funny strings that contained a � character. I recognized the Unicode replacement character and quickly figured out that the cause was that a multi-byte UTF-8 character was was split between two packets and thus could not be correctly converted to a string. The solution was simple, just accumulate the data as bytes and only convert to string when all the data has been received.

This memory surfaced when I performed a code review for a colleague who was facing a 1 MiB size limitation when using Chrome’s Native Messaging, his solution was to cut the message into chunks and send them one after the other.

I warned him about the danger of arbitrarily splitting a UTF-8 string without checking if you’re at a character boundary.

As mentioned in Wikipedia’s entry for UTF-8, one of the main advantages with UFT-8 is that it is backwards compatible with ASCII, this means that all ASCII characters have the same meaning in UTF-8. Since ASCII uses 7 bits and have a 0 MSB in UTF-8 a 0 MSB denotes a single byte character. The first byte of all multi-byte characters begin with 1 bits times the number of bytes in the character, followed by a (e.g. a three byte character will start with 1110). All the other bytes in the character (known as continuation bytes) all begin with 10.

Here’s a summary table:

First bit(s) Condition It is a Rule
(byte & 0x80) == 0
Single byte character It’s OK to cut before or after it
(byte & 0xC0) == 0x80
Continuation byte Do not cut before or after it
(byte & 0xC0) == 0xC0
First bye of multi-byte character It’s OK to cut before it but not after it