Slicing up a UTF-8 string

30 03 2015

A couple of years ago I had to deal with some low level code that sent a UTF-8 encoded string as packets of bytes. At first I converted to string and stored a concatenation of the result but I got a defect saying that we would sometimes get funny strings that contained a � character. I recognized the Unicode replacement character and quickly figured out that the cause was that a multi-byte UTF-8 character was was split between two packets and thus could not be correctly converted to a string. The solution was simple, just accumulate the data as bytes and only convert to string when all the data has been received.

This memory surfaced when I performed a code review for a colleague who was facing a 1 MiB size limitation when using Chrome’s Native Messaging, his solution was to cut the message into chunks and send them one after the other.

I warned him about the danger of arbitrarily splitting a UTF-8 string without checking if you’re at a character boundary.

As mentioned in Wikipedia’s entry for UTF-8, one of the main advantages with UFT-8 is that it is backwards compatible with ASCII, this means that all ASCII characters have the same meaning in UTF-8. Since ASCII uses 7 bits and have a 0 MSB in UTF-8 a 0 MSB denotes a single byte character. The first byte of all multi-byte characters begin with 1 bits times the number of bytes in the character, followed by a (e.g. a three byte character will start with 1110). All the other bytes in the character (known as continuation bytes) all begin with 10.

Here’s a summary table:

First bit(s) Condition It is a Rule
(byte & 0x80) == 0
Single byte character It’s OK to cut before or after it
(byte & 0xC0) == 0x80
Continuation byte Do not cut before or after it
(byte & 0xC0) == 0xC0
First bye of multi-byte character It’s OK to cut before it but not after it

OCD is the path to the dark side

6 01 2015

A while back I had to wrap a built in JavaScript function, this is pretty simple thanks to the fact that JavaScript is a dynamic prototype based language. Here’s an example of how this can be done (not the actual function or functionality in question):

(function wrapAddEventListener() {
  var orig = HTMLElement.prototype.addEventListener;
  function wrapper(name, handler, capture) {
    console.log("Added a handler for " + name + ' on ' + this);, name, function(ev) { 
      console.log("Got Event " + ev.type); 
    }, capture);

  HTMLElement.prototype.addEventListener = wrapper;	

The problem was that then my OCD kicked in because now if I type document.body.addEventListener in the console I get the function’s body instead of function addEventListener() { [native code] }. For some reason this bothered me (why?) enough in order to add the following line to the function wrapping code

wrapper.toString = function() { 
    return orig.toString() 

Now this is deceitful and worthless since it doesn’t really achieve anything, debugging into the function will show the wrapper code. Still I felt that for aesthetic reasons this is preferable.

I’m not sure if covering your tracks like this is evil (since it’s deceitful) or acceptable since it isn’t hiding any semantic changes. I’ll just hope its the worst of my sins for the upcoming year…

Converting Unicode to Unicode

11 11 2014

Recently my matchmaker called me over for a consultation. He was facing some trouble with text encoding and since I once read Joel’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!I’m considered an expert (rather than barely competent which is also an overstatement).

From the get go it was obvious that the problem was in converting UTF-8 strings to UTF-16. Two main methods were used for this, the CW2A classes and CComBSTR’s constructor that accepts a const char*. These methods both use the CP_THREAD_ACP code page when converting strings and you cannot set the thread local to be UTF-8.

After introducing a fix we inspected the results in the debugger and were confused by what we saw in the watch window. We therefore decided to have a look at a toy example.

Analyzing the problem

Consider the string “Bugs Я Us” which contains the Russian letter “Я” (ya).

int main(int argc, char* argv[])
	const wchar_t * wide = L"Bugs Я Us";
	CW2A cw2a(wide);
	CW2A cw2a8(wide, CP_UTF8);
	string str = CW2A(wide);
	string str8 = CW2A(wide, CP_UTF8);
	CComBSTR bs(str8.c_str());
	CComBSTR bs8(CA2W(str8.c_str(), CP_UTF8));

Our toy example gave almost the expected results:

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

The things that surprised me were the cells in red, those should have the correct string surely?

Then I remembered about the s8 format specifier which instructs Visual studio to display strings as UTF-8, perhaps the strings are correct but Visual Studio is misleading us! After adding s8 to the watch window things look marginally better. Now only the std::string differs from my expectations.

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

A bit more poking around showed that the reason for this is the std::string’s visualizer uses the s specifier.

You can find the visualizer in:
<VS Install Directory>\Common7\Packages\Debugger\Visualizers\stl.natvis

I added the red 8s to the file (you have to do this as administrator).

<Type Name="std::basic_string&lt;char,*&gt;">
  <DisplayString Condition="_Myres &lt; _BUF_SIZE">{_Bx._Buf,s8}</DisplayString>
  <DisplayString Condition="_Myres &gt;= _BUF_SIZE">{_Bx._Ptr,s8}</DisplayString>
  <StringView Condition="_Myres &lt; _BUF_SIZE">_Bx._Buf,s8</StringView>
  <StringView Condition="_Myres &gt;= _BUF_SIZE">_Bx._Ptr,s8</StringView>


Now, std::string, at least, defaults to UTF-8 representation in the debugger visualizer


You may be asking yourself why there are two lines each for DisplayString and StringView, this is due to the fact that Visual C++’s string uses the Short String Optimization which avoids dynamic allocations for short strings.

I personally think that Visual Studio should allow configuring the default encoding it uses to display strings, much as it allows displaying numbers in hexadecimal format.


Detecting Additional Offenders

After fixing the original bug we tried to find other locations that may be harbouring similar bugs.

Finding all instances of CW2A is easy, just grep for it, but finding places that use a specific overload of CComBSTR’s constructor or assignment operator is more of a problem.

One way to do this is to mark the offending methods as deprecated. Using #pragma deprecated would allow us to deprecate a method without editing VC’s headers but since we want to deprecate a specific overload this is not an option. I had to use my administrator rights again to edit atlcomcli.h.


Now we get a warning for every use of the deprecated method and decide whether you’ve found a lurking bug.




Big in Lichtenstein

30 09 2014

I’ve been neglecting this blog recently but I still pop in once in a while to see if anyone is interested. Since there’s been nothing new in months it’s not surprising that the views are pretty consistently in the low double digits.

One thing that did surprise me was to discover that in the last month the fifth country in regards to visits to my blog was Lichtenstein! 28 views, I didn’t even know Lichtenstein had that many residents.


Even more surprising is that when searching for Liechtenstein, the country is the third Wikipedia result.

Wedding wishes

6 07 2014

Just a quick post since I’m on my way to a cousin’s wedding. This time I won’t be putting on the greeting card what I consider to be the ultimate wedding wish.

May you both be as happy in your marriage as my wife and I thought we would be.


The last time I used this I was later confronted by a worried looking co-worker asking me if everything was OK at home.

Punishments are a Poor Parenting Practice

6 02 2014

They say that you shouldn’t threaten children with punishments, it’s more empowering for children to have the consequences of their actions explained to them.

Say for example that you put your child’s clothes on the radiator so they’re warm and cozy when he gets up.

Now if the child finds it hard to get up in the morning you can tell him:

The radiator has gone off, you should get dressed quickly before your clothes cool down.

And that’s considered good parenting.

If on the other hand, he stays in bed till well after the clothes have reached room temperature and you say:

If you don’t get a bloody move on and get dressed right now I’ll put your clothes in the fridge!

Well in that case, some people may claim, you’re doing things sub-optimally.

Out of memory, out of luck?

6 01 2014

I was recently implementing a new feature that takes a user-supplied file, parses it and adds some slithy toves to the active manxome[1].

Now as I’m sure everyone knows toves consume a lot of memory and slithy toves are amongst the worst offenders. A typical manxome can be expected to contain a single digit number of toves, a number that in extreme cases may rise to the mid twenties. I soon got a defect complaining that if a file with many toves was imported we would encounter an OutOfMemoryException (OOME). This defect contained a screen recording of how to reproduce the defect, in it you could see a directory listing containing a file called 1000_toves.imp which the tester did not select, the directory also contained a file called 2000_toves.imp which was also not selected, the 3000_toves.imp file that was selected did indeed cause an OOME.

The problem with running out of memory is that almost anything you try to do in order to recover from this exception, does in itself, consume more memory. Even roll-back operations that will free up memory when  they are done usually consume some memory while being run. This is why the best way to deal with such an exception is not to be in that situation to begin with. The .NET framework supplies the tool needed in order to not be in that situation to begin with and it’s called MemoryFailPoint the problem I was facing was that I couldn’t find out in advance how much memory I would be consuming.

The simple solution was to define an arbitrarily limit on the number of toves I allowed a file to contain, this artificial limit went against my instincts as a programmer (Scott Meyers would call it a keyhole problem) and is the solution we ultimately chose. I would like to show another solution I explored since it may be the least worse option for someone in some bizarre situation.

The problem this method attempts to solve is of allowing the program to continue functioning after encountering the OOME while giving the strong exception guarantee (i.e. if an exception occurs the program state is as if the operation wasn’t attempted). As things stood not only was the program state changed (some but not all toves were added to the manxome) but worse, the program became unusable, we would encounter OOME after OOME. The basic idea isn’t new, it’s to put some memory aside for a rainy day. If an exception occurs we can then free this memory so that we have enough space to perform roll-back operations.

public void DoStuff(string param)
        var waste = new byte[1024 * 1024 * 100]; // set aside 100 MB
        var random = new Random();
        waste[random.Next(byte.Length)] = 1;


        if (waste[random.Next(byte.Length)] == 1)
            Trace.WriteLine("Do something the compiler can't optimize away");
    catch (OutOfMemoryException oom)
        GC.Collect(); // Now `waste` is collectable, prompt the GC to collect it
        throw new InsufficientMemoryException("", oom); // Wrap OOM so it can be better handled

A couple of notes about what’s going on here

  • I throw InsufficientMemoryException rather than re-throwing the original OOM exception to signal that the program has enough memory to continue, it’s just this operation that failed.
  • Originally there was none of this nonsense of setting a random byte but the compiler kept optimizing waste away. I think that GC.KeepAlive should also work but I didn’t think of it at the time and I no longer have the environment to check it out.

As I said this code was never put to the test so use it at your own risk and only as a last resort[2].

1. These are not the actual names of the entities.
2. I’m sensing a trend here, code I wrote that doesn’t get used seems to find its way to this blog, perhaps I should rename it to undead code.


Get every new post delivered to your Inbox.