## Optimizing Collatz for Klutzes

2 06 2010

A while back I posted about solving Project Euler’s problem 14 (aka Collatz conjecture) in JavaScript,  for those just joining us in lieu of a recap I’ll let Randall Munroe explain things: The object of the exercise is to find the number with the longest Collatz orbit under a certain limit, in other words the number which takes the most iterations to get to 1.

I started using this question when interviewing and it got me thinking how fast I could get it to run. This post is the result, basically it’s a study of how much time I can waste optimizing a very small toy problem as well as a blatant violation of the two rules of optimization:

1. Don’t optimize
2. (Experts only) Don’t optimize yet

Along the way I’ll try to find some general lessons we can learn.

Finding the asymptotic complexity of the maximal Collatz orbit is beyond my meagre mathematical/CS abilities, but I take comfort in knowing that (for the moment at least) I’m in good company. The best case for finding the Collatz orbit length for a number `N` is when `N` is a power of two, in this case we halve the number until we get to one, exactly `log2(N)` steps. However I seem to remember from uni that not all numbers are powers of two.  As for the worst case, this is unknown, if Collatz conjecture is wrong then there exists a number for which it takes `∞` steps, this takes us out of the field of complexity and into the halting problem land. Since the problem at hand requires finding the Collatz orbits of all numbers up to a given `N` we must multiply by `N` to get a minimum of `N*log(N).`

Failing to mathematically deduce the complexity leaves us with the empiric way, if we count how many times we call the collatz function we’ll know its complexity this is charted as operations in the following graph and we can see that it falls between `N*log(N)` and `N2 `(which means that the orbit for each given number is ≤ Θ`(N)`). Other series in the chart are how many operations take place when memoizing takes place (as described in my previous post) and the space complexity (cache size) when using said optimization.

## Setting the baseline

The problem with C++ is that it’s too fast, the original problem required us to find the number with the longest Collatz orbit under one million and do this in less than a minute. The most basic solution takes 0.734 seconds, so I did all my measurements on 10,000,000 which took 8.263 seconds for the iterative solution; this number will serve as our baseline.

```int iterative(int limit)
{
std::pair<int, int> max = std::make_pair(0, 0);
for (int i = 2; i <= limit; ++i) {
unsigned long long curr = i;
int len = 1;
for (; curr != 1; ++len)
curr = collatz(curr);
if (len > max.first)
max = std::make_pair(len, i);
}
return max.second;
}```

## Iterative to recursive

The problem with memoizing the values of the numbers we pass through is that we don’t know what the value will be before we run into it, this makes a recursive solution much more natural than the iterative solution. I made the function receive the cache as a template argument, this way the compiler can inline all the function calls for different types of caches.

```template <class T>
int length(unsigned long long n, T& cache)
{
if (n == 1)
return 1;

int ret = cache.get(n);
if (ret)
return ret;

ret = 1 + length(collatz(n), cache);
cache.set(n, ret);
return ret;
}```

Now all we have to do is plug in the caching methods, first we’ll use no caching to see the algorithm is OK.

```struct no_cache {
void set(unsigned long long, int) { }
int get(unsigned long long) const { return 0; }
};```

Unfortunately the recursive method is much slower (all those function calls aren’t free) and we’re now clocking in at 20.83 seconds (252% of our baseline). Let’s see if we can shave this time off.

## Simple cache

The simplest cache to use is `std::map` the bread and butter of any C++ programmer, here’s an adapter class so we can plug in different types of maps.

```template <class T>
class map_cache {
T map;
public:
int get(unsigned long long n) {
typename T::iterator i = map.find(n);
if (i == map.end())
return 0;

return i->second;
}

void set(unsigned long long n, int i) {
map.insert(std::make_pair(n, i));
}
};```

This clocks in at 20 seconds, a small improvement, but wait, why does it feel so slow? Putting in another timer I see that it’s actually taking 25.3 seconds (306%), the extra time (after the solution was found) goes into destroying the map.

## Hash map

`std::map `has logarithmic complexity, a no-brainer would be to use a constant time cache, luckily Visual Studio 9 supplies `std::tr1::unordered_map`, that’s bound to improve things.

Nope, 26.22, things are just going from bad to worse (317%), how can a constant speed algorithm do worse than a logarithmic one? Well it all depends on your constants, it’s easy to forget that as programmers we mostly use bounded numbers; even logarithmic complexity on a 32 bit number is constant if you take the big view. Anyway things aren’t that bad, if we try using `boost::unordered_map` instead of the Visual Studio’s implementation we “only” take 14.57 seconds (176%).

Lesson 1: I wouldn’t go so far to say that MS’s implementation of unordered_map is inferior but you should remember that you depend on the quality of the implementation you’re using and some implementations will work better with some data sets.

## Vectoring in on a solution

One of my interviewees suggested using a vector rather than a map, at least for the first `N` elements. The reasoning is that we know we’ll be accessing all these numbers so there’s no need to use a data structure that supports sparsity, this way we gain the natural advantages of `std::vector` namely low overhead and high data locality (changes in bold blue).

```template <class T>
class vec_map {
protected:
const unsigned long long limit;
std::vector<int> vec;
T map;
public:
explicit vec_map(int limit)
: limit(limit)
, vec(limit)
{
}

int get(unsigned long long n) const {
if (n < limit)
return vec[static_cast<int>(n)];

typename T::const_iterator i = map.find(n);
if (i != map.end())
return i->second;
return 0;
}

void set(unsigned long long n, int i) {
if ( n < limit)
vec[static_cast<int>(n)] = i;
else
map.insert(std::make_pair(n, i));
}
};```

That made quite a difference, now we’re down to 8.32 seconds (less that 101% of the iterative method but still above 100%) what else can we squeeze out?

Lesson 2: “Invisible” considerations such as data locality and cache friendliness can have a very profound affect.

## Being Picky

Our cache grows to be rather big, for the 10M range we have 21,730,848 elements stored, perhaps we can reduce that number. Lets think of when we can encounter a number.

1. When checking this specific number (relevant only for numbers `≤ N`)
2. Reaching the number from above (coming from `2N`)
3. Reaching the number from below (coming from `(N-1)/3)`

So if we only consider the numbers greater than our initial range any value stored in the cache will be used at most one time! That seems rather wasteful, it means that once we use a number we can remove it from the cache.
It also means that some numbers that are cached are never used since they cannot be not reached from below (if the number -1 is not divisible by 3).

```template <class T>
class selective_vec_map {
protected:
const unsigned long long limit;
std::vector<int> vec;
T map;
public:
explicit selective_vec_map(int limit)
: limit(limit)
, vec(limit)
{
}

int get(unsigned long long n) const {
if (n < limit)
return vec[static_cast<int>(n)];

if (((n-1) % 3) != 0)
return 0;

typename T::const_iterator i = map.find(n);
if (i != map.end()) {
int ret = i->second;
map.erase(i);
return ret;
}
return 0;
}

void set(unsigned long long n, int i) {
if ( n < limit)
vec[static_cast<int>(n)] = i;
else
if (((n-1) % 3) != 0)
map.insert(std::make_pair(n, i));
}
};```

Great now we’re down to 6.095 seconds 73.76% of the iterative solution, all the hard work was worth something!</pat self on back> but wait why are we paying for destruction of the cache? We don’t need to reclaim the memory (the process is exiting, the OS will take care of that) and we know that `int` and `unsigned long long` don’t have side effects during destruction.

This goes against a C++ programmer’s training but leaking memory during exit is not a big deal, all we need to do is add a malicious destructor.

```~selective_vec_map() {
T* leak = new T();
map.swap(*leak);
}```

Now we’re down to 4.77 seconds just 57.8 percent of our original time.

Lesson 3: If you know that there are no needed side effects to destroying objects it may improve the user’s experience to forgo on unneeded cleanup (I personally hate it when an application I close takes its time shutting down).

Here’s a graph to recap. ## The other fork in the road

Another obvious optimization (at least it’s obvious in retrospect, I only thought of it after writing most of this post) is to look at only the second half of the numbers in the range. The number with the longest orbit can’t be less than limit/2 since for every number `N ≤ limit`, `2N` is also in our range and `orbit(N*2) = orbit(N) + 1`. When looking at only the second half of the numbers in the range our iterative solution takes 4.1 seconds (just over 50% of the time as one would expect) and there is no detectable change in the caching methods (we have to cache the same number of elements).
All that hard work which got us down to 58% and changing `2` to `limit/2` does better, it’s almost enough to make me cry.

Lesson 4: Don’t get attached to your code, if something better comes along don’t hesitate to throw out the fruit of your toil.

## The Elephant in the room

Some of you may have noticed that this is an embarrassingly parallel problem, if you haven’t noticed this fact it’s time for a paradigm shift. My computer today is dual core (pretty low end I know) and I don’t expect to see a single core computer (baring embedded devices) in the future. By using Microsoft’s Parallel Patterns Library (available in VS10) I can write code that will scale to use all available CPUs on the target machine.

```#include "ppl.h"
int parallel(int limit)
{
auto max = std::make_pair(0,0);
Concurrency::concurrent_vector<decltype(max)> convec;

Concurrency::parallel_for(limit/2, limit+1, // +1 to include limit
[&convec](int i) {
convec.push_back(std::make_pair(i, orbit(i)));
});

std::for_each(convec.begin(), convec.end(),
[&max](decltype(max) curr) {
if (curr.second > max.second)
max = curr;
});

return max.first;
}```

This example uses some of the C++0x features available in VS10 as well as the PPL library.

Using PPL takes about 58% of the time on a dual core machine, add that to the 50% improvement supplied by `limit/2` and we have a winner even on a measly dual core machine.

Running the same program on a 16 core machine only takes 11.9% with no code changes (or recompilation).

Lesson 5: Technology is changing keep up with it.

Edit: I can’t believe I didn’t know about combinable which makes the parallel code even faster (it keeps one value per thread in TLS).

```#include "ppl.h"
int parallel(int limit)
{
typedef std::pair<int, int> pint;
Concurrency::combinable<pint> comb;

Concurrency::parallel_for(limit/2, limit+1,
[&comb](int i) {
auto p = std::make_pair(i, orbit(i));
if (p.second > comb.local().second)
comb.local() = p;
});

auto max = comb.combine([](pint a, pint b) {
return a.second < b.second? b : a;
});
return max.first;
}
```

I guess I haven’t been keeping up with technology 😦

 I’m using `unsigned long long` since `int` isn’t big enough.
 One of the best talks I’ve seen online was Herb Sutter’s Machine Architecture: Things Your Programming Language Never Told You (video, slides) it’s a bit long (almost two hours) but well worth watching! I gained a whole new perspective on how computers work and what to keep in mind while coding.

### 6 responses

18 06 2010

That’s a pretty interesting look at memoization/caching for that problem.

Was the 16-core machines a dual quad core HT or a real 16 cores?

(as a minor note: : unsign long long … why not use uint64 from (or if you have TR1 extensions?)

19 06 2010

Good point regarding the number of cores, after looking more closely I see that it’s a dual quad core HT.

As for uint64 I’m not familiar with it (and my code fails to compile with the changes) is this C++/CLI specific? (If so I’m using native C++)

15 07 2010

Very interesting post. Thanks for writing all this up.

I got here from your related question at StackOverflow, BTW.

I think a traditional hash table would a be a good solution, one that doesn’t use buckets like unordered_map does but just puts the elements (pairs) directly into the hash table. Since the problem never deletes entries, only inserts them, the implementation could be kept simple. You could pre-allocate the space for the entire table using reserve(), and there would be no memory allocations beyond the initial, large one. unordered_map is like list and map in that it stores elements in individual nodes (ie there’s a heap allocation for every element), which is why the memory deallocation takes sooo long in this example.

That’s if you don’t consider it cheating to know in advance how big a table you’ll need 🙂

15 07 2010

I would in fact consider knowing the size of the hash-table in advance a form of cheating…

Do you know of a traditional hash table for C++ that I could try to plug in to this algorithm? Even if it has to resize the table I assume it will do so in amortized constant time so it may not be so bad that we don’t know the size a priori.

As for never removing elements from the cache, I do in fact in one part of the solution (vector + selective hash) remove unneeded elements (it’s the first time the caching solution drops below the iterative one).

10 05 2015

This is actually _not_ an embarrassingly parallel problem. Your parallel solution is incredibly wasteful, since you’re re-calculating the orbit of low-orbit values many times (probably most/all of the low-orbit values). In fact, unless there’s some number-theoretical point I’m missing, it looks like that’s what you spend _most_ of the time in each thread doing.

So, consider filling out a cache of low-orbit values first, then going parallel on the exhaustive search for the maximum orbit. This should still be pretty straightforward to code.

12 05 2015

I meant to say that the problem is “embarrassingly parallel” for the naive solution. As you could see from the data more complex solutions don’t improve the performance significantly.

I think that if you have all the computer’s cores for yourself then using a parallel brute force method is probably the best solution. Then again this blog post was written (for fun) five years ago so any further improvements are left as an exercise to the reader :).