Converting Unicode to Unicode

11 11 2014

Recently my matchmaker called me over for a consultation. He was facing some trouble with text encoding and since I once read Joel’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!I’m considered an expert (rather than barely competent which is also an overstatement).

From the get go it was obvious that the problem was in converting UTF-8 strings to UTF-16. Two main methods were used for this, the CW2A classes and CComBSTR’s constructor that accepts a const char*. These methods both use the CP_THREAD_ACP code page when converting strings and you cannot set the thread local to be UTF-8.

After introducing a fix we inspected the results in the debugger and were confused by what we saw in the watch window. We therefore decided to have a look at a toy example.

Analyzing the problem

Consider the string “Bugs Я Us” which contains the Russian letter “Я” (ya).

int main(int argc, char* argv[])
{
	const wchar_t * wide = L"Bugs Я Us";
	CW2A cw2a(wide);
	CW2A cw2a8(wide, CP_UTF8);
	string str = CW2A(wide);
	string str8 = CW2A(wide, CP_UTF8);
	CComBSTR bs(str8.c_str());
	CComBSTR bs8(CA2W(str8.c_str(), CP_UTF8));
}

Our toy example gave almost the expected results:

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

The things that surprised me were the cells in red, those should have the correct string surely?

Then I remembered about the s8 format specifier which instructs Visual studio to display strings as UTF-8, perhaps the strings are correct but Visual Studio is misleading us! After adding s8 to the watch window things look marginally better. Now only the std::string differs from my expectations.

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

A bit more poking around showed that the reason for this is the std::string’s visualizer uses the s specifier.

You can find the visualizer in:
<VS Install Directory>\Common7\Packages\Debugger\Visualizers\stl.natvis

I added the red 8s to the file (you have to do this as administrator).

<Type Name="std::basic_string&lt;char,*&gt;">
  <DisplayString Condition="_Myres &lt; _BUF_SIZE">{_Bx._Buf,s8}</DisplayString>
  <DisplayString Condition="_Myres &gt;= _BUF_SIZE">{_Bx._Ptr,s8}</DisplayString>
  <StringView Condition="_Myres &lt; _BUF_SIZE">_Bx._Buf,s8</StringView>
  <StringView Condition="_Myres &gt;= _BUF_SIZE">_Bx._Ptr,s8</StringView>

 

Now, std::string, at least, defaults to UTF-8 representation in the debugger visualizer

watch8

You may be asking yourself why there are two lines each for DisplayString and StringView, this is due to the fact that Visual C++’s string uses the Short String Optimization which avoids dynamic allocations for short strings.

I personally think that Visual Studio should allow configuring the default encoding it uses to display strings, much as it allows displaying numbers in hexadecimal format.

hex

Detecting Additional Offenders

After fixing the original bug we tried to find other locations that may be harbouring similar bugs.

Finding all instances of CW2A is easy, just grep for it, but finding places that use a specific overload of CComBSTR’s constructor or assignment operator is more of a problem.

One way to do this is to mark the offending methods as deprecated. Using #pragma deprecated would allow us to deprecate a method without editing VC’s headers but since we want to deprecate a specific overload this is not an option. I had to use my administrator rights again to edit atlcomcli.h.

declspec

Now we get a warning for every use of the deprecated method and decide whether you’ve found a lurking bug.

warning

 

 

Advertisements

Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: