The code/data duality

6 01 2016

One often wants special code to run the first time a function is invoked. In C++ the naïve way to do this is using function-static variables.

Consider a function that returns a random number. For nostalgia’s sake we want to use rand  (which needs to be seeded via srand before the first use). We may do something like this:

#include <cstdlib>

int random(int max) {

       static bool initalized = false;

       if (!initialized) {

              srand(time(nullptr)); // seed the random generator once

              initialzed = true;


       return rand() % (max+1);


When I was in university, towards the end of the previous millennium, we had to learn VAX assembly (not the most useful skill I’ve ever picked up). That was when I was introduced to the concept of self modifying code. The crux of the matter is pretty obvious in hindsight, code and data aren’t different things, in the end everything is software is bits and bytes. This means that we can modify the code of the program as simply as modifying variables.

I don’t remember my assembly and I’m assuming that anyone reading this either:

  1. Doesn’t know/remember assembly
  2. Doesn’t need me to explain about code re-writing

So I’ll just invent a simple stack based assembly language since I can’t be arsed to re-learn assembly.

If we take the random function above and translate it to pseudo-assembly we would get something like:

# randomInitialized
0x000DAF00:  0x0  # initialized to zero at compile time
# random
0x000DAF04loadAddress randomInitialized
0x000DAF08branchNotZero randomInitizliedmainFlow
0x000DAF0Cload 0 # first run
0x000DAF10call time
0x000DAF14load 1
0x000DAF18storeAddress randomInitialized
0x000DAF1Ccall srand
# mainFlow
0x000DAF20: increment # max is currently on the stack
0x000DAF24: call rand
0x000DAF28: modulo # rand() % max + 1
0x000DAF2C: return

This is pretty straight forward mapping of the C[++] code to assembly and it has the same drawbacks. We perform a comparison every time the function is run (!initialized) although it almost always has the same outcome. Another deficiency of the code is more pronounced in the assembly version, a lot of code is skipped over for most calls which works against the instruction cache.

What we would really want is that every time the function is called, except for the first time, it will just do what it needs to be done. This can be achieved by modifying the code.

We compile the function to start with a jump (aka goto) to some address (outside the function) which calls srand and then replaces the jump with the first instruction we want for the subsequent function calls. The first instruction we want is  increment, for the sake of argument we’ll say that the opcode for increment is  0xADD1.

# initializeRandom
0x000DAF00load 0
0x000DAF04call time
0x000DAF08call srand
0x000DAF0Cload 0xADD1 # opcode(increment)
0x000DAF10storeAddress random
0x000DAF14: jump initializeRandom
0x000DAF18: call rand
0x000DAF1C: modulo # rand() % max + 1
0x000DAF20: return

During the first run the first thing we do is jump out to a memory location that precedes the function proper, then we seed the random number generator and modify the beginning of the function from being a jump to being an increment1. The initial jump is computed in advance so that we store the increment  just after the instruction pointer and then just fall through into the (now) modified function. Subsequent calls to the
function have a lean four instruction function to execute with no conditions and no branches.

0x00DAF14: increment # this is now the first opcode
0x00DAF18: call rand
0x00DAF1C: modulo # rand() % max + 1
0x00DAF20: return

There’s no need to waste space on a static variable, the code is more cache friendly and (at least in my made up assembly) smaller.

So if everything is so good why isn’t this used in practice? OK so high level languages don’t give you direct access to the code parts of your program but the compiler could generate such code, right?

Well I understand next to nothing about compilers but I’m pretty sure that multiple cache levels at least will make things unpractical (not to mention branch prediction).

The most obvious problem with this example is that it’s horrendously thread unsafe.  I suppose that some of the readers have been tearing out their hair from the get go, the original C++ function with the static variable was just as unsafe. An unprotected shared variable could be modified different threads simultaneously which is a data race and undefined behavior (starting with C++11).

As an aside I would like to mention that C++11 introduced “thread-safe function local static initialization”  so a better implementation of random would be:

#include <cstdlib>
int random(int max) {        
    static bool unused = ([]{// define and invoke a lambda that
        srand(time(nullptr)); // seeds the random generator
        return false;

    return rand() % (max+1);

Here I’m depending on the fact that a it’s the compiler’s responsibility to initialize a static variable is only once in a thread safe way. The static variable here is a bool but it’s never really used, all we need is the side effect when creating it (does anyone what to submit a proposal to allow static void variables?).

In most mainstream compiled languages, the code doesn’t have access to the generated machine code. Thus most programmers nowadays have a mental divide between code and data (Lisp programmers, feel free to gloat now). However JavaScript, as a scripting language  with a functional orientation, brings the code/data duality back together again.
Since functions (code) are objects, self mutating code is back in business. Luck would have it that JavaScript is single threaded (mostly) which allows code to mutate in a thread-safe way.

Consider, for example, a wrapper around a WebSocket.
After creating a web-socket it’s not usable until the connection is established. Due to the single threaded nature of JavaScript this means that you have to relinquish control of the thread before using the object. Say we want to store all outgoing messages until the socket is opened and then send them, one way to achieve this would be like this:

function Socket(address) {
    this.socket = new WebSocket(address);
    var queue = []; // captured by 'send' and 'onopen'
    this.send = function (message) {
        if (this.socket.readyState !== WebSocket.OPEN)
    this.socket.onopen = function () {
        // send queued messages
        queue.forEach(msg => this.send(msg));

Now let’s see how the same thing could be achieved with self modifying

function Socket(address) {
    this.socket = new WebSocket(address);
    var queue = []; // captured by 'send' and 'onopen'
    this.send = message => queue.push(message);
    var self = this;
    this.socket.onopen = function () {
        // send queued messages
        queue.forEach(msg => this.send(msg));
        // replace the 'send' message
        self.send = function (message) {

The send function now has two instances, before the socket is fully open
and after it is open. But wait, what about after the socket is closed? For some reason sending on a closed socket outputs an error to the console but does not throw an exception. We can modify this behaviour like this:

this.socket.onclose = function () {
    // replace the 'send' function yet again
    self.send = message => {
       throw Error('Sending on closed socket: ' + message);

Now we see that code is can be more complex than having two different states, it can be a fully fledged state machine. Admittedly, for most code it is a state machine with only one state, no input to the code modifies the code itself. However it may be useful to keep in mind that the code itself can model the problem space in addition to the data and data structures.

1. In real life the instructions aren’t necessarily the same length but you get the idea

Work blog post – Container objects in UFT

5 11 2015

As both my regular readers have probably noticed (hi Mum! [who am I kidding, even my mother doesn’t read this]) the frequency of my posts have gone down from about once a month in the beginning to closer to twice a year. This is a combination of not having anything interesting to say and life not leaving me with much time to say uninteresting stuff.

Well at work they have funny standards to what can be considered interesting and one such subject is adding container test objects to web tests in UFT.

image from the post

It even has images!

This feature was actually implemented in UFT 12.02 (released last year) but there wasn’t enough time to sufficiently QA it and therefore the feature was undocumented and only used by a few beta customers. Starting with UFT 12.50 it’s finally an official, documented feature which is the first step to benefiting our customers. Step two is to have someone actually use it which is were the blog post comes in…

Visual Studio debuger visualizer for JsonCPP

6 06 2015

We’ve been working with the JsonCPP library and while it’s very nice to use, it’s a bit of a pain to debug. For example consider this simple JSON value:

var json = [ 
    "name": "hello", 
    "pi": 3.1415926, 
    "valid": true 

If you want to see the value of the pi in the debugger it would look (after some digging) something like this:

JSON with no visualizer

After enduring this for a bit too long I decided to look for a debuger visualizer for JsonCPP but couldn’t find one. So as a last resort I decided to write one myself. I have to say that I was pleasantly surprised to find that this was pretty simple and after a little work I got to the situation that my debugger window looked much more manageable:

JSON with visualizerIf you want to use this visualizer you can find it at in GitHub’s visualstudio-debugger repository.

Cactus shirt

17 05 2015

This is the second post in my shirts series, in my first post I told about the hobby I had twenty years ago, drawing on shirts. Since they have started falling to pieces and I can’t make myself throw them out I decided to write about them so that they can live on in digital form and I can reclaim some wardrobe space.

This is one of my first shirts so there are a few unforced errors which I attempted to cover up.

The Front


On the front I have my Abrahamic character drinking from a decapitated cactus with the caption “Enjoy Cactus Cola, the Sheik’s thing”. This is a play on Coca Cola’s slogan and the similarity between Sheik and Chic. Since this is a cactus the sheik’s hand is obviously bleeding. One of the reasons I’ve stopped wearing this shirt at home is that my children find the blood very disturbing and can’t help but comment about it whenever they see the shirt.

I then got a stain in the centre of the shirt and had to cover it up, I chose a skull and crossbones with the warning

Use of this product may prove hazardous to haemophiliacs

The Back


The back of the shirt is a bit of a hodgepodge of desert-based jokes.

I have the character from the front of the shirt crawling towards a mirage. You’ll notice that he has a star and crescent armband, this is to cover up my second error where I initially draw the arm as some kind of Möbius strip.

Next to the mirage is the skeleton of a fish (complete with the skeleton of bubbles coming out of its mouth). The idea was that the fish came to live in the mirage and died since there wasn’t really any water there (what can I say I thought it was amusing at the time).

The sun is wearing sunglasses as is its wont and drinking from a can of Mercury with a straw (mercury being both a liquid and a celestial body). The can of mercury is labelled with both its astrological and chemical symbol, I should note that this was before I heard of the Mercury company I later worked for.

Next to that is the skeleton of Joe Camel who died of lung cancer (I thought that was edgy at the time) and a dog gnawing on one of its bones.

And to complete the plethora of pathetic puns are the Cacti family with the Mother cactus, father cactus and their son showing off his muscles.

Slicing up a UTF-8 string

30 03 2015

A couple of years ago I had to deal with some low level code that sent a UTF-8 encoded string as packets of bytes. At first I converted to string and stored a concatenation of the result but I got a defect saying that we would sometimes get funny strings that contained a � character. I recognized the Unicode replacement character and quickly figured out that the cause was that a multi-byte UTF-8 character was was split between two packets and thus could not be correctly converted to a string. The solution was simple, just accumulate the data as bytes and only convert to string when all the data has been received.

This memory surfaced when I performed a code review for a colleague who was facing a 1 MiB size limitation when using Chrome’s Native Messaging, his solution was to cut the message into chunks and send them one after the other.

I warned him about the danger of arbitrarily splitting a UTF-8 string without checking if you’re at a character boundary.

As mentioned in Wikipedia’s entry for UTF-8, one of the main advantages with UFT-8 is that it is backwards compatible with ASCII, this means that all ASCII characters have the same meaning in UTF-8. Since ASCII uses 7 bits and have a 0 MSB in UTF-8 a 0 MSB denotes a single byte character. The first byte of all multi-byte characters begin with 1 bits times the number of bytes in the character, followed by a (e.g. a three byte character will start with 1110). All the other bytes in the character (known as continuation bytes) all begin with 10.

Here’s a summary table:

First bit(s) Condition It is a Rule
(byte & 0x80) == 0
Single byte character It’s OK to cut before or after it
(byte & 0xC0) == 0x80
Continuation byte Do not cut before or after it
(byte & 0xC0) == 0xC0
First bye of multi-byte character It’s OK to cut before it but not after it

OCD is the path to the dark side

6 01 2015

A while back I had to wrap a built in JavaScript function, this is pretty simple thanks to the fact that JavaScript is a dynamic prototype based language. Here’s an example of how this can be done (not the actual function or functionality in question):

(function wrapAddEventListener() {
  var orig = HTMLElement.prototype.addEventListener;
  function wrapper(name, handler, capture) {
    console.log("Added a handler for " + name + ' on ' + this);, name, function(ev) { 
      console.log("Got Event " + ev.type); 
    }, capture);

  HTMLElement.prototype.addEventListener = wrapper;	

The problem was that then my OCD kicked in because now if I type document.body.addEventListener in the console I get the function’s body instead of function addEventListener() { [native code] }. For some reason this bothered me (why?) enough in order to add the following line to the function wrapping code

wrapper.toString = function() { 
    return orig.toString() 

Now this is deceitful and worthless since it doesn’t really achieve anything, debugging into the function will show the wrapper code. Still I felt that for aesthetic reasons this is preferable.

I’m not sure if covering your tracks like this is evil (since it’s deceitful) or acceptable since it isn’t hiding any semantic changes. I’ll just hope its the worst of my sins for the upcoming year…

Converting Unicode to Unicode

11 11 2014

Recently my matchmaker called me over for a consultation. He was facing some trouble with text encoding and since I once read Joel’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!I’m considered an expert (rather than barely competent which is also an overstatement).

From the get go it was obvious that the problem was in converting UTF-8 strings to UTF-16. Two main methods were used for this, the CW2A classes and CComBSTR’s constructor that accepts a const char*. These methods both use the CP_THREAD_ACP code page when converting strings and you cannot set the thread local to be UTF-8.

After introducing a fix we inspected the results in the debugger and were confused by what we saw in the watch window. We therefore decided to have a look at a toy example.

Analyzing the problem

Consider the string “Bugs Я Us” which contains the Russian letter “Я” (ya).

int main(int argc, char* argv[])
	const wchar_t * wide = L"Bugs Я Us";
	CW2A cw2a(wide);
	CW2A cw2a8(wide, CP_UTF8);
	string str = CW2A(wide);
	string str8 = CW2A(wide, CP_UTF8);
	CComBSTR bs(str8.c_str());
	CComBSTR bs8(CA2W(str8.c_str(), CP_UTF8));

Our toy example gave almost the expected results:

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

The things that surprised me were the cells in red, those should have the correct string surely?

Then I remembered about the s8 format specifier which instructs Visual studio to display strings as UTF-8, perhaps the strings are correct but Visual Studio is misleading us! After adding s8 to the watch window things look marginally better. Now only the std::string differs from my expectations.

Type Default CP_UTF8
CW2A Bugs ? Us Bugs Я Us
std::string Bugs ? Us Bugs Я Us
CComBSTR Bugs Я Us Bugs Я Us

A bit more poking around showed that the reason for this is the std::string’s visualizer uses the s specifier.

You can find the visualizer in:
<VS Install Directory>\Common7\Packages\Debugger\Visualizers\stl.natvis

I added the red 8s to the file (you have to do this as administrator).

<Type Name="std::basic_string&lt;char,*&gt;">
  <DisplayString Condition="_Myres &lt; _BUF_SIZE">{_Bx._Buf,s8}</DisplayString>
  <DisplayString Condition="_Myres &gt;= _BUF_SIZE">{_Bx._Ptr,s8}</DisplayString>
  <StringView Condition="_Myres &lt; _BUF_SIZE">_Bx._Buf,s8</StringView>
  <StringView Condition="_Myres &gt;= _BUF_SIZE">_Bx._Ptr,s8</StringView>


Now, std::string, at least, defaults to UTF-8 representation in the debugger visualizer


You may be asking yourself why there are two lines each for DisplayString and StringView, this is due to the fact that Visual C++’s string uses the Short String Optimization which avoids dynamic allocations for short strings.

I personally think that Visual Studio should allow configuring the default encoding it uses to display strings, much as it allows displaying numbers in hexadecimal format.


Detecting Additional Offenders

After fixing the original bug we tried to find other locations that may be harbouring similar bugs.

Finding all instances of CW2A is easy, just grep for it, but finding places that use a specific overload of CComBSTR’s constructor or assignment operator is more of a problem.

One way to do this is to mark the offending methods as deprecated. Using #pragma deprecated would allow us to deprecate a method without editing VC’s headers but since we want to deprecate a specific overload this is not an option. I had to use my administrator rights again to edit atlcomcli.h.


Now we get a warning for every use of the deprecated method and decide whether you’ve found a lurking bug.