A couple of years ago I had to deal with some low level code that sent a UTF-8 encoded string as packets of bytes. At first I converted to string and stored a concatenation of the result but I got a defect saying that we would sometimes get funny strings that contained a � character. I recognized the Unicode replacement character and quickly figured out that the cause was that a multi-byte UTF-8 character was was split between two packets and thus could not be correctly converted to a string. The solution was simple, just accumulate the data as bytes and only convert to string when all the data has been received.
This memory surfaced when I performed a code review for a colleague who was facing a 1 MiB size limitation when using Chrome’s Native Messaging, his solution was to cut the message into chunks and send them one after the other.
I warned him about the danger of arbitrarily splitting a UTF-8 string without checking if you’re at a character boundary.
As mentioned in Wikipedia’s entry for UTF-8, one of the main advantages with UFT-8 is that it is backwards compatible with ASCII, this means that all ASCII characters have the same meaning in UTF-8. Since ASCII uses 7 bits and have a 0 MSB in UTF-8 a 0 MSB denotes a single byte character. The first byte of all multi-byte characters begin with 1 bits times the number of bytes in the character, followed by a 0 (e.g. a three byte character will start with 1110). All the other bytes in the character (known as continuation bytes) all begin with 10.
Here’s a summary table:
|First bit(s)||Condition||It is a||Rule|
(byte & 0x80) == 0
|Single byte character||It’s OK to cut before or after it|
(byte & 0xC0) == 0x80
|Continuation byte||Do not cut before or after it|
(byte & 0xC0) == 0xC0
|First bye of multi-byte character||It’s OK to cut before it but not after it|