## Approximation Algorithms – Where’s the Theory?

I’ve been thinking a bit recently about approximation algorithms in general and how to compare multiple approximation algorithms for the same problem. Strangely enough, there seems to be almost no (useful) theory on how to compare them, so I started thinking about why that might be.

Exact algorithms are usually deterministic and take a roughly fixed amount of time to solve a particular problem instance. As much as big-O notation is often very misleading, it at least gives a very clear way of comparing exact algorithms that are able to be expressed in that way (which is most). However, most approximate algorithms, in use or proposed, do not fit into that category. This issue also surfaces for exact algorithms where the “average case” performance (what is most important in reality, as opposed to the worst case, which is examined most in theory) is difficult to quantify other than empirically.

It can become very difficult to compare deterministic algorithms to non-deterministic ones, algorithms that can take any chosen amount of time to ones that for a given instance take a fixed amount of time, algorithms with tweaking parameters to ones without, etc. Even within the algorithms where the amount of time is flexible, there are ones where the amount of time (or number of iterations) must effectively be chosen ahead of time (e.g. regular simulated annealing), and ones where if the alloted time didn’t get a good enough answer, it can keep running for longer to get a better answer (e.g. some local search algorithms). There are also algorithms where they are actually classes of algorithms, where one can select, for example *O*(*n*^2) for a faster approximation, *O*(*n*^3) for a more accurate approximation, *O*(*n*^4) for better still, etc. but not a full sliding time scale (e.g. a set of Max. Independent Set algorithms I worked with).

So how do these algorithms get compared? The only way that seems to be used widely at the moment is benchmarks. Benchmarks are specific problem instances that are put together, like a set of performance test cases, usually to cover a few different types of problem instance, often focusing on problem instances that are “hard” for existing algorithms. Conceptually, this makes a lot of sense, it gets “real-world” performance data for the algorithms and whichever perform the best are the best algorithms. This also works for comparing exact algorithms. Of course, this is basically like theorists conceding complete defeat at trying to classify approximation algorithms and just pretending like nobody ever asked them, which wouldn’t be too much of an issue if there weren’t still two issues with respect to using this approach for comparing approximation algorithms.

The first issue is that benchmarks usually can’t take into account both accuracy and time. What happens is that a benchmark instance is either run until a fixed accuracy is reached with the time being what is recorded, or run until a fixed time has elapsed with the accuracy being what is recorded. As such, algorithms that quickly get a very rough accuracy but take a long time to get a high accuracy may have skewed results, ones where the algorithm takes some amount of time before getting any answer may not have valid results, and numerous other such issues come up, most based on the differences I mentioned above. This could partly be addressed by having benchmark instances with finishing criteria designed to cover all types of algorithms, but doing this and analyzing the results are both non-trivial tasks.

The second issue sounds silly but is actually a major problem, and in my mind, it amounts to cheating. I mentioned that benchmark instances are often designed to be “hard” for most current algorithms. Well, what if you just make an algorithm that is designed to do well on the benchmark instances? The answer is that you often end up with an algorithm that doesn’t do well on average but gets great scores on the benchmark. This is like the graphics cards that have custom settings for different benchmarking programs so that they can claim they earned a much higher score than they deserve. Unfortunately, this technique has also all but made a farce of the DIMACS clique benchmark instances. The issue isn’t the benchmark itself, but how it’s been abused by algorithms such as DLS-MC (Dynamic Local Search for Max. Clique) where it even has magic parameters that are tuned to different values through experimentation and then only the fastest time is reported (and performance varies wildly based on these parameters). The problem then becomes knowing what class of instances are important to you, so that you can devise a benchmark that is representative of what you need instead of one that tries to cover all classes. That may or may not be feasible depending on the circumstances, and doesn’t solve the first problem.

The kicker is that supposing that you know the class of problems that you need solved and can generate instances of that type, *and* that there is even a nice fine line of “good enough” accuracy for the first issue, it’s still usually infeasible to clearly define what it means to be “good enough” versus “not good enough”.

So where does that leave us? Well, I’ve got a not-very-good solution, and it’s still based on benchmarks, but it at least gives a mostly fair way of comparing approximation algorithms of the different qualities. It can be summed up as monitoring the accuracy (let’s say, from 0 to 1) over time while running each algorithm. Suppose we have 5 hypothetical algorithms for the same problem.

The above would be the results for one benchmark instance, but this could be generalized to handle results from many benchmark instances. Note that C and D are algorithms that don’t allow a smoothly varying time; effectively, they have no result until that time, and never improve beyond then. Algorithm A does allow smoothly varying time, but will never get beyond 0.8 accuracy, whereas E will get to 1.0 eventually. B has several stages of improvement. Despite these qualitative differences, this works for comparing them.

In the case of the one instance, we mostly care about whatever’s on top (unless a large range of accuracies might be desired in the real scenarios). Each of the above algorithms is “the best” at some time.

This is just a much easier way of reading the top of the chart above. With this, if you want to find the fastest algorithm for getting to a certain accuracy, plot a horizontal line at that accuracy and what it intersects is the fastest algorithm. If you want to find the most accurate algorithm in a fixed amount of time, plot a vertical line at that time, and what it intersects is the most accurate algorithm.

Of course, this gets much more complicated with many more benchmarks; the chart effectively becomes more than 2 dimensions. One could effectively get 3 dimensions with a colour plot as long as the results are monotone increasing, which is often the case, so that can work for when there’s one parameter (e.g. input size) affecting the running time. However, in some cases, these extra dimensions can be squashed back into a 2D chart.

As a simpler task to imagine, think of comparing a non-deterministic algorithm on the 2D chart. Sometimes it will do better or worse at different times, but with many runs, one can establish a distribution of the accuracy at any point in time. Then, instead of plotting just the line for the average, one could plot a range that is most opaque where most accuracy points were and more transparent farther out. Then, it’s still relatively easy to compare the algorithms on the chart, even comparing non-deterministic and deterministic. The same type of distribution display could sometimes be used to show “normalized” runtime results for different benchmark instances, meaning that it would still be fairly easy to see what algorithm is fastest for a desired minimum accuracy. This normalization is really just scaling until the curves match as much as possible. One could try “normalizing” accuracy instead of runtime to find what algorithm has best accuracy for a fixed amount of time, but I suspect that the results of that normalization would be fairly skewed.

As I said, it’s not a very good solution, and there’s still not much theory there, but it’s at least a relatively fair way of comparing approximation algorithms. It is sometimes possible to establish that the runtime is, for example, *O*(*a*^2 *n*^3), where *a* is the accuracy as a parameter in the range 1 to infinity (i.e. 1/(1-*x*), where *x* is accuracy from 0 to 1), in which case some theory could begin to be used, but it gets complicated to compare algorithms with more than one parameter in the big-O.

Anyone else have any ideas about comparing approximation algorithms?