Bug war story

22 02 2010

Every developer seems to have a war story to tell the new recruits around the camp fire espresso machine, well this is mine. It all happened about 9 years ago and most of the memories are still fresh in my mind (post-trauma is like that) but some details are a bit foggy and have been reconstructed to the best of my ability.

My first job out of university was at Mainsoft, their flagship product at the time allowed you to compile your Win32 C++ code to Unix.  Mainsoft has this gift of taking  extremely talented people  (present company excluded of course) in order to create awesome technical solutions to obscure problems which nobody would ever want to pay for.

Getting MS Visual C++ code to compile natively on Unix is not a simple task (Microsoft love adding non-standard extensions to C++) and when you do get it to compile (and link) you have to supply all of the Win32 API so the damn thing will actually run.

One of my first tasks was to re-implement the registry as a service, up till then the registry was implemented as a file mapped to shared memory. This had the unfortunate implication that if the customer’s code (or ours) had a bug which crashed the program or wrote to a dangling pointer, the registry would be corrupted. My job was to write a daemon which handled all the registry functions and replace the implementation in advapi.dll with stubs that call the registry service.  A lot of the work involved marshalling the data, you can’t depend on the types of parameters, is BYTE* a pointer to one BYTE or to an array? If it’s an array what’s its length? Luckily Microsoft document whether each parameter is in, out or both, whether pointers may be NULL and how to determine lengths of buffers. I built a pretty nifty framework with template classes for each of these eventualities so that data was only transmitted between processes when needed.

Simulation of regedit on Linux

Simulation of regedit on Linux

As I said I was fresh out of university and one of the few things we learnt in Software Engineering (most of the curriculum was Computer Science) was to “build one to throw away” (I think they called it prototyping), so I wrote a prototype with lots of printf statements[1]. When it was done I was told that rather than throw it away and start from scratch I should get it to work on the other supported Unices. Instead of removing all the printfs (and thinking they may be useful in the future), I used dup2[2] to map STDOUT to /dev/null.

When the project was almost done a new feature was added, create backups for the registry state, I don’t know if there was any actual need for this but it was easy enough to do so we added it. There was a small hiccough when I tried to find a Unix system call for copying files, but then I remembered that we were implementing Win32’s API and I could use CopyFile rather than re-implement it.

A few days later I got a bug report from our QA about application crashes when using the registry APIs, nothing consistent but it only happens in the latest version of the registry server I supplied. This made no sense, the only change I made was backing up the registry files (in the daemon) and the crashes are happening in the registry client processes. By attaching to a code dump I can see that the problem occurred when a string was returned as an out parameter, first the string length is transmitted, then the client allocates enough space to store it and copies it over. The problem is that the length received was astronomical, something on the order of 2 gigabytes! (remember this is circa 2001).  The allocation failed and the crash wasn’t far away.

After a few days of trying to reproduce the problem (with intermittent success) QA managed to find a reproducible scenario. This scenario involved deleting  the user’s registry file (if there is no registry present we copy over an initial registry file) then an automated test was run and 20 minutes later the client application crashed. The crash only happened when in the optimized build of the the registry server (although it didn’t matter if the client was built in debug or optimized) and it always happened in the same registry function call. I put a breakpoint in the server and saw that we received the call and gave the correct answer in response, then I noticed something weird, the client app got the bad answer before we sent the correct one.

I was officially baffled and I called a more experienced developer in for a consult. He had a look at the length received, 1,702,129,253, printed it out in hex 0x65746E65 and said:

– “this looks like a string.”

– “huh” said I intelligently.

– “Look, every two bytes are around 0x60 just like ASCII characters.”

We cast the address of the number to char* and got “entering function foobar()”, a quick grep later we realized that this was one of my printfed lines.

But how was printf getting transmitted across processes? That question turned out to be easy to answer, the file descriptor accept returned for the new connection was 1, now everybody knows[citation needed] that 1 is STDOUT so printf (which writes to STDOUT) was being sent over the socket. But why was accept returning 1? It wasn’t a free descriptor!

Or was it?

By using truss (that’s strace in Linux) we found that someone was indeed calling close on 1 and the culprit was none other than CopyFile, when the registry server was launched it backed up the current registry file, however this file didn’t exist. When asked to copy a non-existant file CopyFile should fail gracefully however this is what our implementation of CopyFile looked like:

bool CopyFile(const char* from, const char* to, bool failIfExists)
    int fdFrom, fdTo;
    bool success = false;
    fdFrom = open(from, O_RDONLY);
    if (fdFrom == -1)
        goto End;
    int openFlag = O_WRONLY;
    if (failIfExist)
        openFlag = openFlag | O_EXCL;

    fdTo = open(from, openFlag);
    if (fdTo == -1)
        goto End;
/* Actual work done here */
    return success;

When from didn’t exist a <gasp>goto</gasp> was used, to skip over fdTo‘s initialization, so close was called on an uninitialized variable which just happened to be 1[3], nobody told printf that the file descriptor it was writing to was closed and it went on failing merrily. At some point down the line accept saw that 1 was available and returned it causing the subsequent printf‘s to succeed (although not exactly as intended).

Lessons Learnt

These lessons seem trivial now but as a new developer this bug really drove them home.

  • Always compile with warnings treated as errors (-Wall -Werror)
    • Always initialize your variables, C++ isn’t C, (nowadays even C isn’t C) you can declare a variable anywhere so declare it when you can give it a value
  • Bits are bits are bits, if something has a value that doesn’t make sense try to look at it another way. The first part of most objects is the virtual pointer, if that doesn’t point into the memory space of your DLL then you’re probably not looking at an object after all.
  • goto really should be considered harmful, this was the first time I saw goto in real code and it bit me, hard. I have seen benevolent goto’s but as a rule they should be avoided
  • Last but not least, there is no such thing as a small feature, even the most trivial change can have astonishing outcomes

Reminiscing about this bug also made me thing about how it relates to Linus’ Law:  all bugs are shallow. On the one hand this bug was extremely shallow (it could have been caught by the compiler), on the other hand its shallowness was confined to a specific function, far, far away from where it manifested. For all Mainsoft’s customer’s CopyFile is a black box, for them no number of eyeballs on the code would have discovered that trying to copy a non-existent file would cause STDOUT to be closed. On the gripping hand if Linus’ law is only relevant when the whole stack you’re dealing with is open source then it’s close to useless, how many opportunities are there to work on such a stack where even the hardware is open source?

[1] At the time Solaris acted up when iostreams were used in shared object files (.so, DLLs in WinSpeak).

[2] I can’t believe I still remember this function after seven years of not working on Unix. 

[3] In the optimized build to wasn’t initialized, in the debug build it was initialized to zero, apparently closing STDIN didn’t bother anyone.




5 responses

10 03 2010

lol i work on a product that uses the bastard stepchild that is mainwin, and trust me, it hasnt gotten better in these 10 years

12 03 2010

goto really should be considered harmful, this was the first time I saw goto in real code and it bit me, hard.

So you’ve seen a broken goto once (and being the first time you saw it in real code, you obviously not have read enough serious code — Linux comes to mind), probably also somebody has told you goto is BAD once or twice (or maybe you have even read a paper from a well known computer scientists, without even reading rebutals from other well known computer scientists), and your conclusion is that goto should be avoided as a rule ?

14 03 2010
Motti Lanzkron

I’ve seen a broken goto once in my first year of work and many non-broken ones since then.
I didn’t say that goto should never be used rather that they should be avoided. In my 10 years of developing I’ve never run across the necessity of using a goto and I try to avoid doing so.

My view is that the “rules” are there to warn the novices and make the experts think twice however experts should also know when to break the “rules”.


17 06 2010

This isn’t goto’s fault. The non-goto method is to use an if statement to skip over the body and it has the exact same effect.

Also, “with intermediate success” should (presumably) be “with intermittent success”.

17 06 2010
Motti Lanzkron

Thanks for catching the typo (fixed), as for your point about if having the same problem; I disagree, since if have a code block associated with them it’s harder to miss initializations.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: