Visual Studio debuger visualizer for JsonCPP

6 06 2015

We’ve been working with the JsonCPP library and while it’s very nice to use, it’s a bit of a pain to debug. For example consider this simple JSON value:

var json = [ 
  42, 
  { 
    "name": "hello", 
    "pi": 3.1415926, 
    "valid": true 
  }
]

If you want to see the value of the pi in the debugger it would look (after some digging) something like this:

JSON with no visualizer

After enduring this for a bit too long I decided to look for a debuger visualizer for JsonCPP but couldn’t find one. So as a last resort I decided to write one myself. I have to say that I was pleasantly surprised to find that this was pretty simple and after a little work I got to the situation that my debugger window looked much more manageable:

JSON with visualizerIf you want to use this visualizer you can find it at in GitHub’s visualstudio-debugger repository.

Advertisements




Slicing up a UTF-8 string

30 03 2015

A couple of years ago I had to deal with some low level code that sent a UTF-8 encoded string as packets of bytes. At first I converted to string and stored a concatenation of the result but I got a defect saying that we would sometimes get funny strings that contained a � character. I recognized the Unicode replacement character and quickly figured out that the cause was that a multi-byte UTF-8 character was was split between two packets and thus could not be correctly converted to a string. The solution was simple, just accumulate the data as bytes and only convert to string when all the data has been received.

This memory surfaced when I performed a code review for a colleague who was facing a 1 MiB size limitation when using Chrome’s Native Messaging, his solution was to cut the message into chunks and send them one after the other.

I warned him about the danger of arbitrarily splitting a UTF-8 string without checking if you’re at a character boundary.

As mentioned in Wikipedia’s entry for UTF-8, one of the main advantages with UFT-8 is that it is backwards compatible with ASCII, this means that all ASCII characters have the same meaning in UTF-8. Since ASCII uses 7 bits and have a 0 MSB in UTF-8 a 0 MSB denotes a single byte character. The first byte of all multi-byte characters begin with 1 bits times the number of bytes in the character, followed by a (e.g. a three byte character will start with 1110). All the other bytes in the character (known as continuation bytes) all begin with 10.

Here’s a summary table:

First bit(s) Condition It is a Rule
0
(byte & 0x80) == 0
Single byte character It’s OK to cut before or after it
10
(byte & 0xC0) == 0x80
Continuation byte Do not cut before or after it
11
(byte & 0xC0) == 0xC0
First bye of multi-byte character It’s OK to cut before it but not after it




OCD is the path to the dark side

6 01 2015

A while back I had to wrap a built in JavaScript function, this is pretty simple thanks to the fact that JavaScript is a dynamic prototype based language. Here’s an example of how this can be done (not the actual function or functionality in question):

(function wrapAddEventListener() {
  var orig = HTMLElement.prototype.addEventListener;
  function wrapper(name, handler, capture) {
    console.log("Added a handler for " + name + ' on ' + this); 
    orig.call(this, name, function(ev) { 
      console.log("Got Event " + ev.type); 
      handler(ev); 
    }, capture);
  };

  HTMLElement.prototype.addEventListener = wrapper;	
})();

The problem was that then my OCD kicked in because now if I type document.body.addEventListener in the console I get the function’s body instead of function addEventListener() { [native code] }. For some reason this bothered me (why?) enough in order to add the following line to the function wrapping code

wrapper.toString = function() { 
    return orig.toString() 
}

Now this is deceitful and worthless since it doesn’t really achieve anything, debugging into the function will show the wrapper code. Still I felt that for aesthetic reasons this is preferable.

I’m not sure if covering your tracks like this is evil (since it’s deceitful) or acceptable since it isn’t hiding any semantic changes. I’ll just hope its the worst of my sins for the upcoming year…





Converting Unicode to Unicode

11 11 2014

Recently my matchmaker called me over for a consultation. He was facing some trouble with text encoding and since I once read Joel’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!I’m considered an expert (rather than barely competent which is also an overstatement).

From the get go it was obvious that the problem was in converting UTF-8 strings to UTF-16. Two main methods were used for this, the CW2A classes and CComBSTR’s constructor that accepts a const char*. These methods both use the CP_THREAD_ACP code page when converting strings and you cannot set the thread local to be UTF-8.

After introducing a fix we inspected the results in the debugger and were confused by what we saw in the watch window. We therefore decided to have a look at a toy example.

Analyzing the problem

Consider the string “Bugs Я Us” which contains the Russian letter “Я” (ya).

int main(int argc, char* argv[])
{
	const wchar_t * wide = L"Bugs Я Us";
	CW2A cw2a(wide);
	CW2A cw2a8(wide, CP_UTF8);
	string str = CW2A(wide);
	string str8 = CW2A(wide, CP_UTF8);
	CComBSTR bs(str8.c_str());
	CComBSTR bs8(CA2W(str8.c_str(), CP_UTF8));
}

Our toy example gave almost the expected results:

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

The things that surprised me were the cells in red, those should have the correct string surely?

Then I remembered about the s8 format specifier which instructs Visual studio to display strings as UTF-8, perhaps the strings are correct but Visual Studio is misleading us! After adding s8 to the watch window things look marginally better. Now only the std::string differs from my expectations.

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

A bit more poking around showed that the reason for this is the std::string’s visualizer uses the s specifier.

You can find the visualizer in:
<VS Install Directory>\Common7\Packages\Debugger\Visualizers\stl.natvis

I added the red 8s to the file (you have to do this as administrator).

<Type Name="std::basic_string&lt;char,*&gt;">
  <DisplayString Condition="_Myres &lt; _BUF_SIZE">{_Bx._Buf,s8}</DisplayString>
  <DisplayString Condition="_Myres &gt;= _BUF_SIZE">{_Bx._Ptr,s8}</DisplayString>
  <StringView Condition="_Myres &lt; _BUF_SIZE">_Bx._Buf,s8</StringView>
  <StringView Condition="_Myres &gt;= _BUF_SIZE">_Bx._Ptr,s8</StringView>

 

Now, std::string, at least, defaults to UTF-8 representation in the debugger visualizer

watch8

You may be asking yourself why there are two lines each for DisplayString and StringView, this is due to the fact that Visual C++’s string uses the Short String Optimization which avoids dynamic allocations for short strings.

I personally think that Visual Studio should allow configuring the default encoding it uses to display strings, much as it allows displaying numbers in hexadecimal format.

hex

Detecting Additional Offenders

After fixing the original bug we tried to find other locations that may be harbouring similar bugs.

Finding all instances of CW2A is easy, just grep for it, but finding places that use a specific overload of CComBSTR’s constructor or assignment operator is more of a problem.

One way to do this is to mark the offending methods as deprecated. Using #pragma deprecated would allow us to deprecate a method without editing VC’s headers but since we want to deprecate a specific overload this is not an option. I had to use my administrator rights again to edit atlcomcli.h.

declspec

Now we get a warning for every use of the deprecated method and decide whether you’ve found a lurking bug.

warning

 

 





Out of memory, out of luck?

6 01 2014

I was recently implementing a new feature that takes a user-supplied file, parses it and adds some slithy toves to the active manxome[1].

Now as I’m sure everyone knows toves consume a lot of memory and slithy toves are amongst the worst offenders. A typical manxome can be expected to contain a single digit number of toves, a number that in extreme cases may rise to the mid twenties. I soon got a defect complaining that if a file with many toves was imported we would encounter an OutOfMemoryException (OOME). This defect contained a screen recording of how to reproduce the defect, in it you could see a directory listing containing a file called 1000_toves.imp which the tester did not select, the directory also contained a file called 2000_toves.imp which was also not selected, the 3000_toves.imp file that was selected did indeed cause an OOME.

The problem with running out of memory is that almost anything you try to do in order to recover from this exception, does in itself, consume more memory. Even roll-back operations that will free up memory when  they are done usually consume some memory while being run. This is why the best way to deal with such an exception is not to be in that situation to begin with. The .NET framework supplies the tool needed in order to not be in that situation to begin with and it’s called MemoryFailPoint the problem I was facing was that I couldn’t find out in advance how much memory I would be consuming.

The simple solution was to define an arbitrarily limit on the number of toves I allowed a file to contain, this artificial limit went against my instincts as a programmer (Scott Meyers would call it a keyhole problem) and is the solution we ultimately chose. I would like to show another solution I explored since it may be the least worse option for someone in some bizarre situation.

The problem this method attempts to solve is of allowing the program to continue functioning after encountering the OOME while giving the strong exception guarantee (i.e. if an exception occurs the program state is as if the operation wasn’t attempted). As things stood not only was the program state changed (some but not all toves were added to the manxome) but worse, the program became unusable, we would encounter OOME after OOME. The basic idea isn’t new, it’s to put some memory aside for a rainy day. If an exception occurs we can then free this memory so that we have enough space to perform roll-back operations.

public void DoStuff(string param)
{
    try
    {
        var waste = new byte[1024 * 1024 * 100]; // set aside 100 MB
        var random = new Random();
        waste[random.Next(byte.Length)] = 1;

        DoStuffImpl(param);

        if (waste[random.Next(byte.Length)] == 1)
            Trace.WriteLine("Do something the compiler can't optimize away");
    }
    catch (OutOfMemoryException oom)
    {
        GC.Collect(); // Now `waste` is collectable, prompt the GC to collect it
        throw new InsufficientMemoryException("", oom); // Wrap OOM so it can be better handled
    }
}

A couple of notes about what’s going on here

  • I throw InsufficientMemoryException rather than re-throwing the original OOM exception to signal that the program has enough memory to continue, it’s just this operation that failed.
  • Originally there was none of this nonsense of setting a random byte but the compiler kept optimizing waste away. I think that GC.KeepAlive should also work but I didn’t think of it at the time and I no longer have the environment to check it out.

As I said this code was never put to the test so use it at your own risk and only as a last resort[2].


1. These are not the actual names of the entities.
2. I’m sensing a trend here, code I wrote that doesn’t get used seems to find its way to this blog, perhaps I should rename it to undead code.




Press Cancel to cancel (or Cancelable Asserts)

2 10 2013

Back when I developed on Unix asserts where simple, if an assert fired the application would abort with a nice core dump which could be then debugged at leisure. When I moved to Windows development one of the changes I had to get used to was that Visual Studio’s _ASSERTs were not fatal.

Regular VS ASSERTE

At first glance this looks like an improvement, you can choose which asserts are fatal and which are not.

There is the obvious wart “Press Retry to debug” I had to read this line several hundred times before it became automatic. Still, all in all, an improvement. However the situation on one of our mega-lines-of-code projects was not so good. It’s a bit embarrassing to admit this (if others have faced the same situation please comment below, misery loves company) but the debug build of the application became very cumbersome to use. This was probably the result of having some teams work only with production builds and some with debug builds and from the fact that some flows in a specific team’s code would only happen when used in a particular way from other teams’ modules. Whatever the cause the result was that when working with the debug build you would have to press Ignore many, many times.

The sane thing to do would have been to treat an ASSERT being activated as a critical defect and either fix the calling code or, if the assert was a false positive, remove the assert. Politics got in the way of sanity and changing the code that contained the assert would often be more effort than clicking Ignore a couple of times.

Usually the situation wasn’t so dire, after all a code base does not typically contain that many asserts, after all in order for an assert to exist someone had to write it explicitly. It got worse when the (false positive) assert was in a loop or part of a recursive call, in these cases you would have to Ignore the same assert dozens if not hundreds of times. In fact it got so bad that the debug version of the application had a menu item which turned off assertions using _CrtSetReportMode(CRT_ASSERT, _CRTDBG_MODE_DEBUG).

I had a small insight, it doesn’t really matter how many times an assert fires, it should optimally not fire at all, but if it does it doesn’t matter if it’s once or a thousand times. If a false-positive assert does creep into your code, the short term goal, is to get it to stop bothering you. For this purpose I wrote the CANCLEABLE_ASSERT macro, this macro never got committed to our code base since it’s obviously not the right thing to do. The right thing to do is fix all the asserts but perhaps it would have been the pragmatic thing to do since this product eventually got to the state where almost nobody used the debug build.

I retouched the macro a bit  for the purposes of this post and here it is in all its glory (or lack thereof).

#ifdef _DEBUG
#include <windows.h>
#include <intrin.h>

#define CANCLEABLE_ASSERT(expr)              /*1*/ \
  do {                                       /*2*/ \
    static bool check(true);                 /*3*/ \
    if(check &&!(expr)) {                    /*4*/ \
      switch(MessageBoxA(NULL,                     \
      "Assertion failed: "#expr                    \
      "\n\nFile: " __FILE__                        \
      "\nLine: " _CRT_STRINGIZE(__LINE__)          \
      "\n\nDo you want to debug?"                  \
      "\n(Cancel means don't assert here again)",  \
      "Debug?", MB_YESNOCANCEL | MB_ICONHAND)){	   \
      case IDYES:                                  \
        __debugbreak()                       /*5*/ \
        break;                                     \
      case IDCANCEL:                               \
        check = false;                             \
      }                                            \
   }                                               \
} while((void)false, false)                  /*6*/
#else
#define CANCLEABLE_ASSERT(x)((void)0)
#endif

A bit of explanation for anyone interested.

  1. If this is the first time you see multi-line macros, having a backslash at the end of the line pulls the next line into the macro, this is why I used C style comments  and not C++ for numbering, otherwise the rest of the macro would also be commented out.
  2. The whole thing is wrapped by a do { } while in order to make this a single statement (here’s why).
  3. The do { } while also introduces a new scope which makes the static boolean very localized and we don’t have to worry about giving it a unique name.
  4. the condition is only evaluated if this specific assert hasn’t been canceled due to logical and short-circuiting this is a nice feature since expr may be arbitrarily costly to compute (this is also why expr is parenthesised).
  5. In case we choose to debug the code the int 3 assembly instruction breaks into the debugger, note that in assembly int stands for interrupt, not integer  Edit: Thanks to Ofek for suggesting I use __debugbreak() instead.
  6. The while condition is ugly in order to prevent compiler warnings.

Cancelable asserts

As an added benefit this dialog has more intiative buttons. Yes I want to debug, No I do not want to debug, please Cancel this breakpoint for the duration of this program.

As I mentioned above, this code did not make into our code base so I can’t vouch for it 100% but I’m hereby placing it in the public domain. I hope you find it useful (or at least mildly interesting).





Malkovich? Malkovich Malkovich!

15 07 2013

If you ever logged logged in to a windows computer that uses the Japanese display language you probably saw that the path separator is not the common backslash (\) but the Yen symbol (¥).

Well at least that was what I thought, I recently got a defect that we haven’t localized one of our dialogs correctly and it showed backslash on Japanese OSs. My first thought was that somebody hardcoded ‘\’ instead of Path.DirectorySeparatorChar but a quick look at the code showed that we were using the path as supplied by the OS. This forced me to learn something new which I will now inflict on you.

Since reading Joel’s classic The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)  my understanding was that the Unicode range of [0-127] (aka ANSI or ASCII) was the same the world over. Quote:

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up…

As appropriate for a post describing the “absolute minimum” this is not the whole story, Michael Kaplan’s post when is a backslash not a backslash taught me that on Japanese OSs a backslash (Unicode U+005c which is 92 – less than 128) it is displayed as Yen (¥), even though it is not the Unicode Yen character (Unicode U+00A5). This means that the path separator is still backslash it’s only displayed as a Yen, it also means that the actual Yen is not a path separator. Since Yen is not a path separator it can be used in file names and the following path can mean several different files:

C:¥¥¥¥¥¥¥¥¥¥¥¥¥.¥The first ¥ must actually be a backslash (and the second can’t be a backslash) which means that the file in question may be any of the following:

c:\¥¥¥¥¥¥¥¥\¥¥¥.¥
c:\¥¥¥\¥¥¥¥\¥¥¥.¥
c:\¥¥\¥¥\¥¥\¥¥¥.¥
c:\¥¥¥¥¥¥¥¥\¥¥¥.¥
c:\¥\¥\¥\¥\¥¥¥¥.¥
... and many more

c:\¥\¥\¥\¥\¥¥¥¥.¥

The same story applies to Korean Won ().

tl;dr how a backslash appears depends on the font you use, the path separator is not localized on Japanese OSs.


A more topical title for this post would probably be Hodor hodor hodor.