Converting Unicode to Unicode

11 11 2014

Recently my matchmaker called me over for a consultation. He was facing some trouble with text encoding and since I once read Joel’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!I’m considered an expert (rather than barely competent which is also an overstatement).

From the get go it was obvious that the problem was in converting UTF-8 strings to UTF-16. Two main methods were used for this, the CW2A classes and CComBSTR’s constructor that accepts a const char*. These methods both use the CP_THREAD_ACP code page when converting strings and you cannot set the thread local to be UTF-8.

After introducing a fix we inspected the results in the debugger and were confused by what we saw in the watch window. We therefore decided to have a look at a toy example.

Analyzing the problem

Consider the string “Bugs Я Us” which contains the Russian letter “Я” (ya).

int main(int argc, char* argv[])
{
	const wchar_t * wide = L"Bugs Я Us";
	CW2A cw2a(wide);
	CW2A cw2a8(wide, CP_UTF8);
	string str = CW2A(wide);
	string str8 = CW2A(wide, CP_UTF8);
	CComBSTR bs(str8.c_str());
	CComBSTR bs8(CA2W(str8.c_str(), CP_UTF8));
}

Our toy example gave almost the expected results:

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

The things that surprised me were the cells in red, those should have the correct string surely?

Then I remembered about the s8 format specifier which instructs Visual studio to display strings as UTF-8, perhaps the strings are correct but Visual Studio is misleading us! After adding s8 to the watch window things look marginally better. Now only the std::string differs from my expectations.

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

A bit more poking around showed that the reason for this is the std::string’s visualizer uses the s specifier.

You can find the visualizer in:
<VS Install Directory>\Common7\Packages\Debugger\Visualizers\stl.natvis

I added the red 8s to the file (you have to do this as administrator).

<Type Name="std::basic_string&lt;char,*&gt;">
  <DisplayString Condition="_Myres &lt; _BUF_SIZE">{_Bx._Buf,s8}</DisplayString>
  <DisplayString Condition="_Myres &gt;= _BUF_SIZE">{_Bx._Ptr,s8}</DisplayString>
  <StringView Condition="_Myres &lt; _BUF_SIZE">_Bx._Buf,s8</StringView>
  <StringView Condition="_Myres &gt;= _BUF_SIZE">_Bx._Ptr,s8</StringView>

 

Now, std::string, at least, defaults to UTF-8 representation in the debugger visualizer

watch8

You may be asking yourself why there are two lines each for DisplayString and StringView, this is due to the fact that Visual C++’s string uses the Short String Optimization which avoids dynamic allocations for short strings.

I personally think that Visual Studio should allow configuring the default encoding it uses to display strings, much as it allows displaying numbers in hexadecimal format.

hex

Detecting Additional Offenders

After fixing the original bug we tried to find other locations that may be harbouring similar bugs.

Finding all instances of CW2A is easy, just grep for it, but finding places that use a specific overload of CComBSTR’s constructor or assignment operator is more of a problem.

One way to do this is to mark the offending methods as deprecated. Using #pragma deprecated would allow us to deprecate a method without editing VC’s headers but since we want to deprecate a specific overload this is not an option. I had to use my administrator rights again to edit atlcomcli.h.

declspec

Now we get a warning for every use of the deprecated method and decide whether you’ve found a lurking bug.

warning

 

 

Advertisements




Global BSTRs

27 09 2012

Working with COM from C++ is pretty natural, after all COM’s interfaces are binary compatible with C++ virtual tables. However there are some places in which the fact that COM was designed to work with C leaks through and causes some friction.

A notable example of this is BSTR, under the covers all BSTR is is a typedef to wchar_t* however a BSTR is conceptually different from a plain string, a BSTR may contain embed Nul characters and is therefore not Nul terminated, instead the length is stored in the memory that precedes the actual string data. BSTRs also resist the concept of const correctness, you can’t have a const BSTR which can be a bit of a pain when you want to have BSTR constants.

What should one do when two different pieces of code want to share the value of a BSTR? There are several options.

1. #define a string literal in a header

#define XPATH_CHEAP_BOOKS L"//genre/book[price<7]/title"

One can pass a (wide) string literal to a function expecting a BSTR and this will even work …if you don’t use BSTR specific methods or marshal the parameter.

void Foo(BSTR bs)
{
    wcout << L"The length of '" << bs << "' is " << SysStringLen(bs) << endl;
}

Foo(L"Your guess is as good as mine");

Code that ran correctly (by accident) for years and years can suddenly start to fail if the class implementing an interface changes from C++ to .NET, if there’s an interop call between C++ and .NET the marshaller go into play. Since the marshaller knows that the correct way to determine the length of a BSTR is to call SysStringLen it will do so and quite probably get an enormous value, this will cause an out of memory exception when trying to allocate the .NET string.

Obviously the same problems happen with const wchar_t*.

Global CComBSTR

CComBSTR XPATH_CHEAP_BOOKS = L"//genre/book[price<7]/title";

Having a global CComBSTR  (in a header file) has several problems

  1. you have to choose it’s storage class
    •  it can’t be extern (which is the default) because then you’ll get a link-time multiple definition error
    • if it’s static there will be an instance for every translation unit (ie .cpp file), which makes point 2 worse
  2. Each constant will allocate a BSTR when the binary (dll or exe) is loaded, even if it’s never used and even if COM isn’t yet ready to successfully run SysAllocString.
  3. Since CComBSTRs can’t be const someone may change them during the application’s lifetime
  4. When the process terminates the destructors will call SysFreeString, this is just a waste of time, the process is going to return all its memory to the OS anyway, who cares if it’s freed or un-freed memory?

A solution

The first hint at a solution came from  __declspec( selectany), this solves problem #1, an extern global object with __declspec( selectany) will not cause a multiple definition error at link time, instead the linker will arbitrarily select one of the symbols and discard the rest.

The next problem (#2) is that we don’t want to initialize each global too early (or indeed at all in case it isn’t used). In order to do this we can just create a class that has a cast operator to BSTR. This cast operator will use lazy evaluation to create the BSTR the first time it is used and cache the value.

Since the cast operator can be called on different threads some protection must be used, thankfully InterlockedCompareExchangePointer is a perfect fit for our needs.

The rest is pretty straight forward.

  • We can avoid #4 (waste of time in process destruction) by neglecting to write a destructor
  • Although we can’t prevent the clients of our code from changing the BSTR at compile time we can verify that nobody has at runtime (I chose to do so only in debug builds)

We want the class to be light weight therefore it’s best if it is initialized with a string literal so that we can grab the pointer and be sure it will be valid throughout the lifetime of the program and not have to copy it. In order to ensure that we really got a string literal we can make the constructor private and make a macro that creates these objects, in the macro we can verify that the string is a literal by concatenating another (empty) string to it (this will fail at compile time for non literal strings). The macro will use a public member factory function that is named in a way that makes it obvious that it should only be passed a string literal (we are protecting against Murphy, not Machiavelli after all).

Warning: The following code seems to work as expected but has not been rigorously tested, I welcome any corrections.

#ifndef GLOBAL_BSTR_HEADER_PROTECTION_
#define GLOBAL_BSTR_HEADER_PROTECTION_
#pragma once

#include "WTypes.h"// BSTR
#include "crtdbg.h"
#ifdef _DEBUG
#include <cwchar>// for ASSERT in LeakBSTR
#endif

// LeakBSTR is for creating globals with the GLOBAL_BSTR macro therefore there is no need 
// to free the BSTR at destruction (hence no destructor)
class LeakBSTR { 
    BSTR bs;
    // It's safe to hold on to the string pointer in the constructor since
    // the private ctor & macro enforce initialization by string literal
    const wchar_t* str; 

    LeakBSTR(const wchar_t* s)
        :  bs(NULL)
        , str(s)
    { }

public:
    static LeakBSTR constructFromStringLiteral(const wchar_t* s)
    {
        return s;
    }

    operator BSTR() 
    {
        if (bs == NULL) {
            BSTR tmp = SysAllocString(str);
            if (InterlockedCompareExchangePointer((void**)&bs, tmp, NULL) != NULL) 
                SysFreeString(tmp); // Another thread already initialized 'bs'
        }

        // make sure nobody messed with the BSTR since it was created
        _ASSERTE(std::wcscmp(str, bs) == 0); 
        return bs;
    }
};

#define GLOBAL_BSTR(name, value) __declspec(selectany) LeakBSTR \
    name(LeakBSTR::constructFromStringLiteral(L"" value L""))

GLOBAL_BSTR(EmptyBSTR, L"");

#endif




A const BSTR isn’t

11 06 2012

If you’ve programmed any COM you’ve come across the ubiquitous BSTR string type. You may have even been tempted to have a read-only BSTR and used a const BSTR. Unfortunately this is not to be. If we look at WTypes.h we’ll see that BSTR is defined thus:

typedef WCHAR OLECHAR;
typedef OLECHAR* BSTR;

Basically a BSTR is a wchar_t*[1], but what happens when we try using a const BSTR?

The rule of thumb when trying to understand what is const in a C++ deceleration is to push the const as far as you can to the right without passing an asterisk and then read from right to left.

const int i; // int const i => i is a const-int 
const int *p; // int const * p => p is a pointer to a const-int
int * const q; // q is a const-pointer to int
const int * const pq; // int const * const pq => pq is a const-pointer to a const-int

This implies that const BSTR is the same as BSTR const, since the typedef captured the asterisk as part of the type the const can travel over it and produce a “wchar_t* const“, i.e.  the pointer is immutable, not the individual characters.

const BSTR deceptive = SysAllocString(L"Read only!");
deceptive[9] = '?'; // deceptive = "Read only?"
deceptive = NULL; // Compile error, deceptive is const

This is probably the exact opposite of the intended purpose.

The same translates to CComBSTR (and _bstr_t), having a const CComBSTR (and passing parameters by const reference) probably don’t do what you think they should.

The only advantage a const CComBSTR has is that it has static linkage but if that’s what you want you should just say static CComBSTR.


[1] At least at the static type level, at the semantic level it’s another story.