Converting Unicode to Unicode

11 11 2014

Recently my matchmaker called me over for a consultation. He was facing some trouble with text encoding and since I once read Joel’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!I’m considered an expert (rather than barely competent which is also an overstatement).

From the get go it was obvious that the problem was in converting UTF-8 strings to UTF-16. Two main methods were used for this, the CW2A classes and CComBSTR’s constructor that accepts a const char*. These methods both use the CP_THREAD_ACP code page when converting strings and you cannot set the thread local to be UTF-8.

After introducing a fix we inspected the results in the debugger and were confused by what we saw in the watch window. We therefore decided to have a look at a toy example.

Analyzing the problem

Consider the string “Bugs Я Us” which contains the Russian letter “Я” (ya).

int main(int argc, char* argv[])
{
	const wchar_t * wide = L"Bugs Я Us";
	CW2A cw2a(wide);
	CW2A cw2a8(wide, CP_UTF8);
	string str = CW2A(wide);
	string str8 = CW2A(wide, CP_UTF8);
	CComBSTR bs(str8.c_str());
	CComBSTR bs8(CA2W(str8.c_str(), CP_UTF8));
}

Our toy example gave almost the expected results:

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

The things that surprised me were the cells in red, those should have the correct string surely?

Then I remembered about the s8 format specifier which instructs Visual studio to display strings as UTF-8, perhaps the strings are correct but Visual Studio is misleading us! After adding s8 to the watch window things look marginally better. Now only the std::string differs from my expectations.

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

A bit more poking around showed that the reason for this is the std::string’s visualizer uses the s specifier.

You can find the visualizer in:
<VS Install Directory>\Common7\Packages\Debugger\Visualizers\stl.natvis

I added the red 8s to the file (you have to do this as administrator).

<Type Name="std::basic_string&lt;char,*&gt;">
  <DisplayString Condition="_Myres &lt; _BUF_SIZE">{_Bx._Buf,s8}</DisplayString>
  <DisplayString Condition="_Myres &gt;= _BUF_SIZE">{_Bx._Ptr,s8}</DisplayString>
  <StringView Condition="_Myres &lt; _BUF_SIZE">_Bx._Buf,s8</StringView>
  <StringView Condition="_Myres &gt;= _BUF_SIZE">_Bx._Ptr,s8</StringView>

 

Now, std::string, at least, defaults to UTF-8 representation in the debugger visualizer

watch8

You may be asking yourself why there are two lines each for DisplayString and StringView, this is due to the fact that Visual C++’s string uses the Short String Optimization which avoids dynamic allocations for short strings.

I personally think that Visual Studio should allow configuring the default encoding it uses to display strings, much as it allows displaying numbers in hexadecimal format.

hex

Detecting Additional Offenders

After fixing the original bug we tried to find other locations that may be harbouring similar bugs.

Finding all instances of CW2A is easy, just grep for it, but finding places that use a specific overload of CComBSTR’s constructor or assignment operator is more of a problem.

One way to do this is to mark the offending methods as deprecated. Using #pragma deprecated would allow us to deprecate a method without editing VC’s headers but since we want to deprecate a specific overload this is not an option. I had to use my administrator rights again to edit atlcomcli.h.

declspec

Now we get a warning for every use of the deprecated method and decide whether you’ve found a lurking bug.

warning

 

 





Big in Lichtenstein

30 09 2014

I’ve been neglecting this blog recently but I still pop in once in a while to see if anyone is interested. Since there’s been nothing new in months it’s not surprising that the views are pretty consistently in the low double digits.

One thing that did surprise me was to discover that in the last month the fifth country in regards to visits to my blog was Lichtenstein! 28 views, I didn’t even know Lichtenstein had that many residents.

Liechtenstein

Even more surprising is that when searching for Liechtenstein, the country is the third Wikipedia result.





Wedding wishes

6 07 2014

Just a quick post since I’m on my way to a cousin’s wedding. This time I won’t be putting on the greeting card what I consider to be the ultimate wedding wish.

May you both be as happy in your marriage as my wife and I thought we would be.

 

The last time I used this I was later confronted by a worried looking co-worker asking me if everything was OK at home.





Punishments are a Poor Parenting Practice

6 02 2014

They say that you shouldn’t threaten children with punishments, it’s more empowering for children to have the consequences of their actions explained to them.

Say for example that you put your child’s clothes on the radiator so they’re warm and cozy when he gets up.

Now if the child finds it hard to get up in the morning you can tell him:

The radiator has gone off, you should get dressed quickly before your clothes cool down.

And that’s considered good parenting.

If on the other hand, he stays in bed till well after the clothes have reached room temperature and you say:

If you don’t get a bloody move on and get dressed right now I’ll put your clothes in the fridge!

Well in that case, some people may claim, you’re doing things sub-optimally.





Out of memory, out of luck?

6 01 2014

I was recently implementing a new feature that takes a user-supplied file, parses it and adds some slithy toves to the active manxome[1].

Now as I’m sure everyone knows toves consume a lot of memory and slithy toves are amongst the worst offenders. A typical manxome can be expected to contain a single digit number of toves, a number that in extreme cases may rise to the mid twenties. I soon got a defect complaining that if a file with many toves was imported we would encounter an OutOfMemoryException (OOME). This defect contained a screen recording of how to reproduce the defect, in it you could see a directory listing containing a file called 1000_toves.imp which the tester did not select, the directory also contained a file called 2000_toves.imp which was also not selected, the 3000_toves.imp file that was selected did indeed cause an OOME.

The problem with running out of memory is that almost anything you try to do in order to recover from this exception, does in itself, consume more memory. Even roll-back operations that will free up memory when  they are done usually consume some memory while being run. This is why the best way to deal with such an exception is not to be in that situation to begin with. The .NET framework supplies the tool needed in order to not be in that situation to begin with and it’s called MemoryFailPoint the problem I was facing was that I couldn’t find out in advance how much memory I would be consuming.

The simple solution was to define an arbitrarily limit on the number of toves I allowed a file to contain, this artificial limit went against my instincts as a programmer (Scott Meyers would call it a keyhole problem) and is the solution we ultimately chose. I would like to show another solution I explored since it may be the least worse option for someone in some bizarre situation.

The problem this method attempts to solve is of allowing the program to continue functioning after encountering the OOME while giving the strong exception guarantee (i.e. if an exception occurs the program state is as if the operation wasn’t attempted). As things stood not only was the program state changed (some but not all toves were added to the manxome) but worse, the program became unusable, we would encounter OOME after OOME. The basic idea isn’t new, it’s to put some memory aside for a rainy day. If an exception occurs we can then free this memory so that we have enough space to perform roll-back operations.

public void DoStuff(string param)
{
    try
    {
        var waste = new byte[1024 * 1024 * 100]; // set aside 100 MB
        var random = new Random();
        waste[random.Next(byte.Length)] = 1;

        DoStuffImpl(param);

        if (waste[random.Next(byte.Length)] == 1)
            Trace.WriteLine("Do something the compiler can't optimize away");
    }
    catch (OutOfMemoryException oom)
    {
        GC.Collect(); // Now `waste` is collectable, prompt the GC to collect it
        throw new InsufficientMemoryException("", oom); // Wrap OOM so it can be better handled
    }
}

A couple of notes about what’s going on here

  • I throw InsufficientMemoryException rather than re-throwing the original OOM exception to signal that the program has enough memory to continue, it’s just this operation that failed.
  • Originally there was none of this nonsense of setting a random byte but the compiler kept optimizing waste away. I think that GC.KeepAlive should also work but I didn’t think of it at the time and I no longer have the environment to check it out.

As I said this code was never put to the test so use it at your own risk and only as a last resort[2].


1. These are not the actual names of the entities.
2. I’m sensing a trend here, code I wrote that doesn’t get used seems to find its way to this blog, perhaps I should rename it to undead code.




For better mileage…

5 11 2013

In the early nineties I had a hobby, I would buy blank T-shirts and paint on them. I never had any art training (or talent) but since it was my work I was pretty fond of these shirts.

At first they were day to day shirts but they faded and stretched so I started wearing them when hiking and eventually as pyjamas.

Sadly now, twenty years after being created, these shirts are no longer even fit to be pyjamas. In retrospect I probably shouldn’t have bought the cheapest shirts available. Still, I don’t have the heart to throw them out. Perhaps by blogging about them  I will feel as if their spirit lives on and I’ll be able to let them go (into the dumpster…)

So without further ado, here’s shirt #1 For better mileage…

Front

The front of the shirt features a person I think of as either Abraham or an archetypal Bedouin he’s walking by a sign that says “Really far awayand there’s a caption underneath him saying (can you guess it?) For better mileage…

Back

On the back of the shirt you can see the same guy ridding The new (this is in very faded orange) Sheik 2000™ Stretch Camel.

As I said this was created in the early nineties when the year 2000 was still the stuff of science fiction. The sign now says “Pretty close now” and you can tell that the three-humped camel is the latest in luxury, it even has two antennae (was I thinking of mobile phones back then? I don’t remember).

Well that’s the first in my shirt series, expect some more (not too many) when I run out of actual things to say.





Press Cancel to cancel (or Cancelable Asserts)

2 10 2013

Back when I developed on Unix asserts where simple, if an assert fired the application would abort with a nice core dump which could be then debugged at leisure. When I moved to Windows development one of the changes I had to get used to was that Visual Studio’s _ASSERTs were not fatal.

Regular VS ASSERTE

At first glance this looks like an improvement, you can choose which asserts are fatal and which are not.

There is the obvious wart “Press Retry to debug” I had to read this line several hundred times before it became automatic. Still, all in all, an improvement. However the situation on one of our mega-lines-of-code projects was not so good. It’s a bit embarrassing to admit this (if others have faced the same situation please comment below, misery loves company) but the debug build of the application became very cumbersome to use. This was probably the result of having some teams work only with production builds and some with debug builds and from the fact that some flows in a specific team’s code would only happen when used in a particular way from other teams’ modules. Whatever the cause the result was that when working with the debug build you would have to press Ignore many, many times.

The sane thing to do would have been to treat an ASSERT being activated as a critical defect and either fix the calling code or, if the assert was a false positive, remove the assert. Politics got in the way of sanity and changing the code that contained the assert would often be more effort than clicking Ignore a couple of times.

Usually the situation wasn’t so dire, after all a code base does not typically contain that many asserts, after all in order for an assert to exist someone had to write it explicitly. It got worse when the (false positive) assert was in a loop or part of a recursive call, in these cases you would have to Ignore the same assert dozens if not hundreds of times. In fact it got so bad that the debug version of the application had a menu item which turned off assertions using _CrtSetReportMode(CRT_ASSERT, _CRTDBG_MODE_DEBUG).

I had a small insight, it doesn’t really matter how many times an assert fires, it should optimally not fire at all, but if it does it doesn’t matter if it’s once or a thousand times. If a false-positive assert does creep into your code, the short term goal, is to get it to stop bothering you. For this purpose I wrote the CANCLEABLE_ASSERT macro, this macro never got committed to our code base since it’s obviously not the right thing to do. The right thing to do is fix all the asserts but perhaps it would have been the pragmatic thing to do since this product eventually got to the state where almost nobody used the debug build.

I retouched the macro a bit  for the purposes of this post and here it is in all its glory (or lack thereof).

#ifdef _DEBUG
#include <windows.h>
#include <intrin.h>

#define CANCLEABLE_ASSERT(expr)              /*1*/ \
  do {                                       /*2*/ \
    static bool check(true);                 /*3*/ \
    if(check &&!(expr)) {                    /*4*/ \
      switch(MessageBoxA(NULL,                     \
      "Assertion failed: "#expr                    \
      "\n\nFile: " __FILE__                        \
      "\nLine: " _CRT_STRINGIZE(__LINE__)          \
      "\n\nDo you want to debug?"                  \
      "\n(Cancel means don't assert here again)",  \
      "Debug?", MB_YESNOCANCEL | MB_ICONHAND)){	   \
      case IDYES:                                  \
        __debugbreak()                       /*5*/ \
        break;                                     \
      case IDCANCEL:                               \
        check = false;                             \
      }                                            \
   }                                               \
} while((void)false, false)                  /*6*/
#else
#define CANCLEABLE_ASSERT(x)((void)0)
#endif

A bit of explanation for anyone interested.

  1. If this is the first time you see multi-line macros, having a backslash at the end of the line pulls the next line into the macro, this is why I used C style comments  and not C++ for numbering, otherwise the rest of the macro would also be commented out.
  2. The whole thing is wrapped by a do { } while in order to make this a single statement (here’s why).
  3. The do { } while also introduces a new scope which makes the static boolean very localized and we don’t have to worry about giving it a unique name.
  4. the condition is only evaluated if this specific assert hasn’t been canceled due to logical and short-circuiting this is a nice feature since expr may be arbitrarily costly to compute (this is also why expr is parenthesised).
  5. In case we choose to debug the code the int 3 assembly instruction breaks into the debugger, note that in assembly int stands for interrupt, not integer  Edit: Thanks to Ofek for suggesting I use __debugbreak() instead.
  6. The while condition is ugly in order to prevent compiler warnings.

Cancelable asserts

As an added benefit this dialog has more intiative buttons. Yes I want to debug, No I do not want to debug, please Cancel this breakpoint for the duration of this program.

As I mentioned above, this code did not make into our code base so I can’t vouch for it 100% but I’m hereby placing it in the public domain. I hope you find it useful (or at least mildly interesting).








Follow

Get every new post delivered to your Inbox.