Off by One Error

6 02 2011

I’m proud to present the latest instalment of indoctrinating my son to be a programmer.

I was giving the boys a bath when #1 said to #2 “you have 14 toes“, I told him his count is a bit off and he replied: “Fooled you, I counted one toe four times” (what can I say? There’s a reason I’m aiming for programmer not stand-up comedian). I then asked him how many toes he would have counted had he really counted the same one four times and was pleasantly surprised when he immediately answered “13“.

Now anyone can make an OBOE but recognising it at once (before the age of 7), well that shows potential.


Bug war story

22 02 2010

Every developer seems to have a war story to tell the new recruits around the camp fire espresso machine, well this is mine. It all happened about 9 years ago and most of the memories are still fresh in my mind (post-trauma is like that) but some details are a bit foggy and have been reconstructed to the best of my ability.

My first job out of university was at Mainsoft, their flagship product at the time allowed you to compile your Win32 C++ code to Unix.  Mainsoft has this gift of taking  extremely talented people  (present company excluded of course) in order to create awesome technical solutions to obscure problems which nobody would ever want to pay for.

Getting MS Visual C++ code to compile natively on Unix is not a simple task (Microsoft love adding non-standard extensions to C++) and when you do get it to compile (and link) you have to supply all of the Win32 API so the damn thing will actually run.

One of my first tasks was to re-implement the registry as a service, up till then the registry was implemented as a file mapped to shared memory. This had the unfortunate implication that if the customer’s code (or ours) had a bug which crashed the program or wrote to a dangling pointer, the registry would be corrupted. My job was to write a daemon which handled all the registry functions and replace the implementation in advapi.dll with stubs that call the registry service.  A lot of the work involved marshalling the data, you can’t depend on the types of parameters, is BYTE* a pointer to one BYTE or to an array? If it’s an array what’s its length? Luckily Microsoft document whether each parameter is in, out or both, whether pointers may be NULL and how to determine lengths of buffers. I built a pretty nifty framework with template classes for each of these eventualities so that data was only transmitted between processes when needed.

Simulation of regedit on Linux

Simulation of regedit on Linux

As I said I was fresh out of university and one of the few things we learnt in Software Engineering (most of the curriculum was Computer Science) was to “build one to throw away” (I think they called it prototyping), so I wrote a prototype with lots of printf statements[1]. When it was done I was told that rather than throw it away and start from scratch I should get it to work on the other supported Unices. Instead of removing all the printfs (and thinking they may be useful in the future), I used dup2[2] to map STDOUT to /dev/null.

When the project was almost done a new feature was added, create backups for the registry state, I don’t know if there was any actual need for this but it was easy enough to do so we added it. There was a small hiccough when I tried to find a Unix system call for copying files, but then I remembered that we were implementing Win32’s API and I could use CopyFile rather than re-implement it.

A few days later I got a bug report from our QA about application crashes when using the registry APIs, nothing consistent but it only happens in the latest version of the registry server I supplied. This made no sense, the only change I made was backing up the registry files (in the daemon) and the crashes are happening in the registry client processes. By attaching to a code dump I can see that the problem occurred when a string was returned as an out parameter, first the string length is transmitted, then the client allocates enough space to store it and copies it over. The problem is that the length received was astronomical, something on the order of 2 gigabytes! (remember this is circa 2001).  The allocation failed and the crash wasn’t far away.

After a few days of trying to reproduce the problem (with intermittent success) QA managed to find a reproducible scenario. This scenario involved deleting  the user’s registry file (if there is no registry present we copy over an initial registry file) then an automated test was run and 20 minutes later the client application crashed. The crash only happened when in the optimized build of the the registry server (although it didn’t matter if the client was built in debug or optimized) and it always happened in the same registry function call. I put a breakpoint in the server and saw that we received the call and gave the correct answer in response, then I noticed something weird, the client app got the bad answer before we sent the correct one.

I was officially baffled and I called a more experienced developer in for a consult. He had a look at the length received, 1,702,129,253, printed it out in hex 0x65746E65 and said:

– “this looks like a string.”

– “huh” said I intelligently.

– “Look, every two bytes are around 0x60 just like ASCII characters.”

We cast the address of the number to char* and got “entering function foobar()”, a quick grep later we realized that this was one of my printfed lines.

But how was printf getting transmitted across processes? That question turned out to be easy to answer, the file descriptor accept returned for the new connection was 1, now everybody knows[citation needed] that 1 is STDOUT so printf (which writes to STDOUT) was being sent over the socket. But why was accept returning 1? It wasn’t a free descriptor!

Or was it?

By using truss (that’s strace in Linux) we found that someone was indeed calling close on 1 and the culprit was none other than CopyFile, when the registry server was launched it backed up the current registry file, however this file didn’t exist. When asked to copy a non-existant file CopyFile should fail gracefully however this is what our implementation of CopyFile looked like:

bool CopyFile(const char* from, const char* to, bool failIfExists)
    int fdFrom, fdTo;
    bool success = false;
    fdFrom = open(from, O_RDONLY);
    if (fdFrom == -1)
        goto End;
    int openFlag = O_WRONLY;
    if (failIfExist)
        openFlag = openFlag | O_EXCL;

    fdTo = open(from, openFlag);
    if (fdTo == -1)
        goto End;
/* Actual work done here */
    return success;

When from didn’t exist a <gasp>goto</gasp> was used, to skip over fdTo‘s initialization, so close was called on an uninitialized variable which just happened to be 1[3], nobody told printf that the file descriptor it was writing to was closed and it went on failing merrily. At some point down the line accept saw that 1 was available and returned it causing the subsequent printf‘s to succeed (although not exactly as intended).

Lessons Learnt

These lessons seem trivial now but as a new developer this bug really drove them home.

  • Always compile with warnings treated as errors (-Wall -Werror)
    • Always initialize your variables, C++ isn’t C, (nowadays even C isn’t C) you can declare a variable anywhere so declare it when you can give it a value
  • Bits are bits are bits, if something has a value that doesn’t make sense try to look at it another way. The first part of most objects is the virtual pointer, if that doesn’t point into the memory space of your DLL then you’re probably not looking at an object after all.
  • goto really should be considered harmful, this was the first time I saw goto in real code and it bit me, hard. I have seen benevolent goto’s but as a rule they should be avoided
  • Last but not least, there is no such thing as a small feature, even the most trivial change can have astonishing outcomes

Reminiscing about this bug also made me thing about how it relates to Linus’ Law:  all bugs are shallow. On the one hand this bug was extremely shallow (it could have been caught by the compiler), on the other hand its shallowness was confined to a specific function, far, far away from where it manifested. For all Mainsoft’s customer’s CopyFile is a black box, for them no number of eyeballs on the code would have discovered that trying to copy a non-existent file would cause STDOUT to be closed. On the gripping hand if Linus’ law is only relevant when the whole stack you’re dealing with is open source then it’s close to useless, how many opportunities are there to work on such a stack where even the hardware is open source?

[1] At the time Solaris acted up when iostreams were used in shared object files (.so, DLLs in WinSpeak).

[2] I can’t believe I still remember this function after seven years of not working on Unix. 

[3] In the optimized build to wasn’t initialized, in the debug build it was initialized to zero, apparently closing STDIN didn’t bother anyone.

The First Bug

27 10 2009

I don’t remember the first bug I ever made but as everybody knows [citation needed] half the fun of parenting is living vicariously through our children.

The other day my first born (age 5¾) decided to make a board game.

He had already made the dice (that’s another story), so he got out his crayons and went to work.

Buggy board game

Buggy board game

The rules were pretty simple

  • Boat: move one step forward
  • Guard tower: lose two turns
  • Torch: move back one step less than the last die throw
  • Cannon (that’s the one with lots of red circles): go back to the beginning
  • Etc.

We were playing merrily along when the configuration in the picture came up, two steps away from the boat, roll the die and…

  1. The die lands on 2
  2. Move two steps to boat
  3. Boat moves us a step forward
  4. Land on torch
  5. Torch moves us back one step less than was rolled
  6. Two was rolled so go back one step
  7. Land on boat (goto 2)

Whoops, we’re in an infinite loop! FB was quick to realize what was going on and quickly stopped his old man from executing the infinite loop (ruining all my fun), we arbitrarily chose which of the two locations to stay on an kept on playing.

What really made my day about this was that this was a non-trivial bug, any non-programmer (which includes most 5 year-olds) would not think to look for an infinite loop and even I who suspected such a thing might occur didn’t really see it before it happened (in my defense I wasn’t looking very hard), most flows didn’t expose this “bug” and it’s only by chance that we ran into it on the first time we played the game.

So anyway that’s my son’s first bug, what was yours?