The code/data duality

6 01 2016

One often wants special code to run the first time a function is invoked. In C++ the naïve way to do this is using function-static variables.

Consider a function that returns a random number. For nostalgia’s sake we want to use rand  (which needs to be seeded via srand before the first use). We may do something like this:

#include <cstdlib>

int random(int max) {

       static bool initalized = false;

       if (!initialized) {

              srand(time(nullptr)); // seed the random generator once

              initialzed = true;

       }

       return rand() % (max+1);

}

When I was in university, towards the end of the previous millennium, we had to learn VAX assembly (not the most useful skill I’ve ever picked up). That was when I was introduced to the concept of self modifying code. The crux of the matter is pretty obvious in hindsight, code and data aren’t different things, in the end everything is software is bits and bytes. This means that we can modify the code of the program as simply as modifying variables.

I don’t remember my assembly and I’m assuming that anyone reading this either:

  1. Doesn’t know/remember assembly
  2. Doesn’t need me to explain about code re-writing

So I’ll just invent a simple stack based assembly language since I can’t be arsed to re-learn assembly.

If we take the random function above and translate it to pseudo-assembly we would get something like:

# randomInitialized
0x000DAF00:  0x0  # initialized to zero at compile time
# random
0x000DAF04loadAddress randomInitialized
0x000DAF08branchNotZero randomInitizliedmainFlow
0x000DAF0Cload 0 # first run
0x000DAF10call time
0x000DAF14load 1
0x000DAF18storeAddress randomInitialized
0x000DAF1Ccall srand
# mainFlow
0x000DAF20: increment # max is currently on the stack
0x000DAF24: call rand
0x000DAF28: modulo # rand() % max + 1
0x000DAF2C: return

This is pretty straight forward mapping of the C[++] code to assembly and it has the same drawbacks. We perform a comparison every time the function is run (!initialized) although it almost always has the same outcome. Another deficiency of the code is more pronounced in the assembly version, a lot of code is skipped over for most calls which works against the instruction cache.

What we would really want is that every time the function is called, except for the first time, it will just do what it needs to be done. This can be achieved by modifying the code.

We compile the function to start with a jump (aka goto) to some address (outside the function) which calls srand and then replaces the jump with the first instruction we want for the subsequent function calls. The first instruction we want is  increment, for the sake of argument we’ll say that the opcode for increment is  0xADD1.

# initializeRandom
0x000DAF00load 0
0x000DAF04call time
0x000DAF08call srand
0x000DAF0Cload 0xADD1 # opcode(increment)
0x000DAF10storeAddress random
random
0x000DAF14: jump initializeRandom
0x000DAF18: call rand
0x000DAF1C: modulo # rand() % max + 1
0x000DAF20: return

During the first run the first thing we do is jump out to a memory location that precedes the function proper, then we seed the random number generator and modify the beginning of the function from being a jump to being an increment1. The initial jump is computed in advance so that we store the increment  just after the instruction pointer and then just fall through into the (now) modified function. Subsequent calls to the
function have a lean four instruction function to execute with no conditions and no branches.

random
0x00DAF14: increment # this is now the first opcode
0x00DAF18: call rand
0x00DAF1C: modulo # rand() % max + 1
0x00DAF20: return

There’s no need to waste space on a static variable, the code is more cache friendly and (at least in my made up assembly) smaller.

So if everything is so good why isn’t this used in practice? OK so high level languages don’t give you direct access to the code parts of your program but the compiler could generate such code, right?

Well I understand next to nothing about compilers but I’m pretty sure that multiple cache levels at least will make things unpractical (not to mention branch prediction).

The most obvious problem with this example is that it’s horrendously thread unsafe.  I suppose that some of the readers have been tearing out their hair from the get go, the original C++ function with the static variable was just as unsafe. An unprotected shared variable could be modified different threads simultaneously which is a data race and undefined behavior (starting with C++11).

As an aside I would like to mention that C++11 introduced “thread-safe function local static initialization”  so a better implementation of random would be:

#include <cstdlib>
int random(int max) {        
    static bool unused = ([]{// define and invoke a lambda that
        srand(time(nullptr)); // seeds the random generator
        return false;
    })();

    return rand() % (max+1);
}

Here I’m depending on the fact that a it’s the compiler’s responsibility to initialize a static variable is only once in a thread safe way. The static variable here is a bool but it’s never really used, all we need is the side effect when creating it (does anyone what to submit a proposal to allow static void variables?).

In most mainstream compiled languages, the code doesn’t have access to the generated machine code. Thus most programmers nowadays have a mental divide between code and data (Lisp programmers, feel free to gloat now). However JavaScript, as a scripting language  with a functional orientation, brings the code/data duality back together again.
Since functions (code) are objects, self mutating code is back in business. Luck would have it that JavaScript is single threaded (mostly) which allows code to mutate in a thread-safe way.

Consider, for example, a wrapper around a WebSocket.
After creating a web-socket it’s not usable until the connection is established. Due to the single threaded nature of JavaScript this means that you have to relinquish control of the thread before using the object. Say we want to store all outgoing messages until the socket is opened and then send them, one way to achieve this would be like this:

function Socket(address) {
    this.socket = new WebSocket(address);
    
    var queue = []; // captured by 'send' and 'onopen'
    this.send = function (message) {
        if (this.socket.readyState !== WebSocket.OPEN)
            queue.push(message);
        else
            this.socket.send(message);
    }
 
    this.socket.onopen = function () {
        // send queued messages
        queue.forEach(msg => this.send(msg));
    };
}

Now let’s see how the same thing could be achieved with self modifying
code:

function Socket(address) {
    this.socket = new WebSocket(address);
 
    var queue = []; // captured by 'send' and 'onopen'
    this.send = message => queue.push(message);
 
    var self = this;
    this.socket.onopen = function () {
        // send queued messages
        queue.forEach(msg => this.send(msg));
        // replace the 'send' message
        self.send = function (message) {
            this.socket.send(message);
        };
    };
}

The send function now has two instances, before the socket is fully open
and after it is open. But wait, what about after the socket is closed? For some reason sending on a closed socket outputs an error to the console but does not throw an exception. We can modify this behaviour like this:

this.socket.onclose = function () {
    // replace the 'send' function yet again
    self.send = message => {
       throw Error('Sending on closed socket: ' + message);
    }
};

Now we see that code is can be more complex than having two different states, it can be a fully fledged state machine. Admittedly, for most code it is a state machine with only one state, no input to the code modifies the code itself. However it may be useful to keep in mind that the code itself can model the problem space in addition to the data and data structures.


1. In real life the instructions aren’t necessarily the same length but you get the idea
Advertisements




Cactus shirt

17 05 2015

This is the second post in my shirts series, in my first post I told about the hobby I had twenty years ago, drawing on shirts. Since they have started falling to pieces and I can’t make myself throw them out I decided to write about them so that they can live on in digital form and I can reclaim some wardrobe space.

This is one of my first shirts so there are a few unforced errors which I attempted to cover up.

The Front

Cactus-front

On the front I have my Abrahamic character drinking from a decapitated cactus with the caption “Enjoy Cactus Cola, the Sheik’s thing”. This is a play on Coca Cola’s slogan and the similarity between Sheik and Chic. Since this is a cactus the sheik’s hand is obviously bleeding. One of the reasons I’ve stopped wearing this shirt at home is that my children find the blood very disturbing and can’t help but comment about it whenever they see the shirt.

I then got a stain in the centre of the shirt and had to cover it up, I chose a skull and crossbones with the warning

Use of this product may prove hazardous to haemophiliacs

The Back

 Cactus-back

The back of the shirt is a bit of a hodgepodge of desert-based jokes.

I have the character from the front of the shirt crawling towards a mirage. You’ll notice that he has a star and crescent armband, this is to cover up my second error where I initially draw the arm as some kind of Möbius strip.

Next to the mirage is the skeleton of a fish (complete with the skeleton of bubbles coming out of its mouth). The idea was that the fish came to live in the mirage and died since there wasn’t really any water there (what can I say I thought it was amusing at the time).

The sun is wearing sunglasses as is its wont and drinking from a can of Mercury with a straw (mercury being both a liquid and a celestial body). The can of mercury is labelled with both its astrological and chemical symbol, I should note that this was before I heard of the Mercury company I later worked for.

Next to that is the skeleton of Joe Camel who died of lung cancer (I thought that was edgy at the time) and a dog gnawing on one of its bones.

And to complete the plethora of pathetic puns are the Cacti family with the Mother cactus, father cactus and their son showing off his muscles.





Big in Lichtenstein

30 09 2014

I’ve been neglecting this blog recently but I still pop in once in a while to see if anyone is interested. Since there’s been nothing new in months it’s not surprising that the views are pretty consistently in the low double digits.

One thing that did surprise me was to discover that in the last month the fifth country in regards to visits to my blog was Lichtenstein! 28 views, I didn’t even know Lichtenstein had that many residents.

Liechtenstein

Even more surprising is that when searching for Liechtenstein, the country is the third Wikipedia result.





A Turkey by any other name…

17 01 2012

Recently we were discussing how the bird Turkey seems to be named after other countries, in Hebrew it’s an Indian chicken, in Portuguese it’s named after Peru and in Arabic it’s Ethiopian.  The fact that turkeys actually originate in North America didn’t seem to be taken into consideration (except by the Russians).

When I stumbled across the List of names for the Wild Turkey Wikipedia page I decided this called for a map.

Names for Turkeys over the globe

The longest path I found involved 6 countries (5 links):

Bulgaria → Egypt → Greece → France → India → Peru





The problem with child slavery…

16 05 2010

… is that children make lousy slaves.

On Friday I had my six year old put sunscreen on my back and now I’m sunburnt in patches.