The Power of Optimization

People are pretty skeptical when I talk about getting big speedups from simple changes, so when optimizing/debugging the program I made recently to generate tough problem instances of Maximum Independent Set (MIS), I timed how long the generator took on one CPU core of my 1.66GHz Core 2 Duo laptop.

The program starts with a random graph that very likely has an independent set of size m, keeps the first independent set of size m that it finds, and adds edges as it searches so that the first one is the only independent set of size m.  There are many papers out there on hiding independent sets (or cliques) that are much larger than most in graphs, but this is the first I’ve heard of hiding an independent set that’s only one node larger than most of the maximal independent sets. I set out with the objective of getting 128-node graphs with a single maximum independent set of size 24.

The first version that worked correctly printed out every independent set it found, since I’d had some issues with correctly traversing the sets, so printing took a vast majority of the time.  Specifically:

nodes initial edges MIS size build printed parameters passed time (s)
32 64 12 32-bit debug state, graph, and added edges separately ~1200
32 64 12 32-bit debug graph and added edges separately 2.63
32 64 12 32-bit debug added edges separately 0.119
32 64 12 32-bit debug none separately 0.084
32 64 12 32-bit release none separately 0.037

This isn’t even doing any “real” optimizations; it was spending 99.997% of the time just printing out data. Speedup from not printing out anything and using release build: 32345x. In fact, the final result is so much shorter, that much of its time could be overhead, so the speedup on a larger instance would be higher.  I consider this like a design optimization, whereas the optimizations below are algorithmic optimizations.  I find that design optimizations usually have the largest impact and result in simpler code, though in this case the code wasn’t really much simpler since it wasn’t really a full-out design change.  Design changes are almost always of the type “I don’t need to do all this stuff… why am I doing it anyway?”  Moral of the story: printing text is ridiculously slow; avoid unnecessary logging.

Now that it was so much faster, I needed a larger test case, so I started ramping up the graph size:

nodes initial edges MIS size build printed parameters time (s)
64 400 16 32-bit release none separately 4.33
96 640 20 32-bit release none separately ~2316
96 640 20 32-bit release none together ~2598

This was the first “real” optimization I tried, and it didn’t work. The search is highly recursive and has 10 parameters, most of which don’t change, so I was thinking that passing them together in a structure would be better, but it wasn’t. The overhead of accessing them indirectly outweighted the savings from only passing 4 parameters. Moral of the story: even for functions with a ton of parameters, function calling is fast; it’s the content of the function that matters… as long as it isn’t a tiny function.

At this point, I went for a trick I’d done before for traversing independent sets, but I wasn’t sure how it’d work in this case, since it has to add edges too. The trick is whenever the number of nodes left that could be added to the current set is 64 or less, I bit-pack the adjacency matrix for the subgraph induced by these nodes and operate on entire rows of the matrix at once, since each row fits into a qword (unsigned long long). With a 64-bit build, this can be done quite efficiently. It took a lot of debugging to get this to work properly, but it was worth it:

nodes initial edges MIS size build printed parameters other time (s)
96 640 20 64-bit debug added edges separately 64-node trick,
error-checking
54.6
96 640 20 64-bit release none separately 64-node trick 18.4
128 1200 24 64-bit release none separately 64-node trick ~666

That did the trick. It was now possible to accomplish the objective of getting a 128-node graph with a unique MIS of size 24 in just over 11 minutes. Speedup from bit-packing, to do many operations in parallel and reduce overhead: 126x. Moral of the story: simple bit operations take a clock cycle or two; fiddling around with data structures usually takes much longer. This still wasn’t quite where I wanted it, since now that it seems feasible, I would like to ramp it up a bit more. Luckily, I could easily improve the early cut-off check I had (i.e. when there aren’t enough nodes left to create an independent set of size m, you can avoid looking there.)

nodes initial edges MIS size build printed parameters other time (s)
128 1200 24 64-bit release none separately 64-node trick
& early cut-off
134
128 900 32 64-bit release none separately 64-node trick
& early cut-off
811

That helped too. Speedup from early cut-off: 5x. Moral of the story: don’t compute things that you don’t need to, as long as it’s easy to check whether you need to compute them or not.

Final speedup: 625x from just algorithmic optimizations. 20 million x total. To generate 50 graphs of the last type, it’ll only take 11.3 CPU-hours now, but it’d have taken 293 CPU-days without any optimization other than removing printing and using a release build.  I haven’t done any detail-level optimization, what most people call “inner-loop optimization”.  I could do a bunch more, i.e. this is still single-threaded and in C; I could do much better in assembly language and with 8 threads on the Intel Core i7 CPU at work. I’ll probably generate 8 graphs at once by adding a simple OpenMP parallel for loop. However, there’s not much use in generating such graphs with more than 128 nodes at the moment, so I’m back to working on a secret side project I hope you’ll like. 🙂

~ by Neil Dickson on January 4, 2010.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: