More Big Software Performance Speedups

As I sit here waiting for Matlab to compute in over 25 minutes what I can now compute in 10 milliseconds in mostly-unoptimized C++ code running in debug mode, I figured I should write a blog post stressing the importance of a few small differences that can result in >150,000x speedups like this.

Note that I’m not putting down Matlab, since it’s not designed to run fast; I’m just using this as an example of why it’s important to not get complacent with performance.

1. Precomputation

This is may be the largest performance difference between the two implementations, though it might be a tight race between this and the next two differences.  Precomputation, however, is very widely applicable, easy to do, and could have been applied to the Matlab code to some extent.

The basic idea is: instead of computing the same data repeatedly, compute them once and save them.

In those words, it sounds so obvious, but it encompasses a broad range of things, some of which are less obvious.  For example, in this case, I’ve precomputed the data before running the program and read it in from input files.  I’ll be running the program using the same data possibly thousands of times, and those data took longer to compute than to read in, so there’s no sense in recomputing them each time.  In particular, I’ve computed about 1,000 points on each of a bunch of curves, and the program linearly interpolates between them, which is plenty sufficient accuracy in this case.

There are of course cases where precomputation  doesn’t make sense, such as if the computation is as fast as looking up the data or if the amount of data to save is unreasonably large, but in a very large range of applications, precomputation can speed things up a ton.

2. Native Code

People will yell until they’re blue in the face that interpreted code or Just-In-Time-compiled code is usually as fast as native code (i.e. compiled to machine code ahead of time).  They’re dead wrong, and usually somewhere up the chain, they got their information from a marketing department or a study funded by Sun, IBM, etc, even though they’re probably not aware of it.  Nonetheless, I was honestly shocked that such extremely simple tests confirm this, e.g. running the following code in Java versus in C++.

int a = 12468;
for (int i = 0; i < 2000000000; ++i) {
    a /= 2;
    a *= 2;
}
// Output a

C++ compilers trivially replace the multiplication with addition, and the division with a bit shift (and a special trick because the compiler doesn’t clue in that the value is always positive or even).  Of course, a really smart compiler would realize that the entire loop is useless, since the starting number is positive and even, but none of them appear to be that smart.  Java however, will not even do the simple replacement no matter how many times the loop is run.  This is visible by comparing against the performance of:

int a = 12468;
for (int i = 0; i < 2000000000; ++i) {
    a >>= 1;
    a += a;
}
// Output a

In Java, this new code is several times faster than the previous, whereas in C++, the performance is (almost) identical between the two, since it’s already made that replacement.

In my case today, I’m not comparing against Java; I’m comparing against Matlab, which not only doesn’t compile its code ahead of time, but literally interprets the text representing the code as it’s running, similar to VisualBasic or Fortran.  That adds factors of thousands for just figuring out what the program is doing.  It just doesn’t stand a chance outside of its prebuilt libraries.  Inside its prebuilt libraries, e.g. matrix multiplication or diagonalization code written in C++, it’s fast, so one must try to use them where reasonable.

3. Customization

As much as Matlab provides some great functionality that can be a pain in the butt to implement from scratch, such as diagonalization of big matrices, it (understandably) doesn’t let you specify everything you know that may help speed up the computation.

For example, matrix-matrix multiplication where you input two matrices and get back their product is very slow because of memory caching issues.  However, if the matrix on the right of the multiplication is already transposed ahead of time, it’s much faster.  (I’ll save the explanation of why for another time.)  Matlab doesn’t know whether there’s a way to rework your code such that some matrices are pre-transposed and others aren’t, it just receives two matrices and has to spit out the product.  If you can use your knowledge of your code to do this, you can get a big performance boost.

Another example in this case is if you have good estimates of the eigenvalues and eigenvectors of the matrix you’ve got, it’s much faster to find them exactly, but Matlab’s diagonalization just uses random guesses.

Conclusion

As I mentioned at the top, the C++ implementation is mostly unoptimized, so I’m still expecting more performance improvements as I scale it up and optimize it, (since speeding up a 10ms computation isn’t all that useful.)  I’ll try to remember to report again when it’s multi-threaded and vectorized.  Nonetheless, simple changes like those above can make a huge difference in the performance of software.

~ by Neil Dickson on November 7, 2010.

5 Responses to “More Big Software Performance Speedups”

  1. I always like your tips on code optimization. But it seems to me that compiling tweaks shouldn’t be confused with runtime. The real test would be the C++ program vs the manually optimized java program. Java still loses though 😀

  2. “Java however, will not even do the simple replacement no matter how many times the loop is run.” Prove it.

    I know you’re wrong because I’m looking at the loop strength reduction optimization here, and I know that it’ll run. Most likely, you’re running your micro-benchmark wrong.

    If you just put a loop into your main method, in most cases that can’t be compiled. You can’t just jump out of a loop into a better optimized version of this. Now, this is no longer the case, but not knowing what JVM or JIT you’re running, I can’t say for sure if you’ve got that feature.

    Say it with me, micro-benchmarks don’t test real world performance, they only test a specific aspect of the code, and it might not be the aspect you’re expecting.

    Secondly, Matlab isn’t really supposed to be fast, it’s supposed to be accurate and more of a utility. Well, that’s not entirely true, Matlab is made to be fast with matrices. If you can refactor your computation into some sort of linear algebra problem, it should be much faster. Also, you can do precomputation in Matlab as well, just rewrite your algorithm to use memoization, and if you’re really hard up for performance, write it in C and make a DLL call from Matlab.

  3. @Guillaume: You make it sound like even a mostrous pile of evidence will never convince you, so I fail to see the point in running more and more benchmarks until we die of old age.

    As I’ve told you before, I have run “real world” programs where Java is 20x slower than the exact same code in C++. I once wrote a heavily-optimized program in Java for work, later ported it to C++ and was heartily disappointed that it was only 2x faster in C++, so there is some variability. I’m not in the business of faking benchmarks to give the results I want; I’m just observing reality. I have never encountered a program that was as fast in Java as it was in C++, and as much as one might be able to construct a program specifically to claim such a result, I would call that practice “academic fraud”, not “real world performance”.

    Quite frankly, if Java can’t do the simplest, most obvious optimizations, it’s absurd to expect it to do any better on more complex code.

  4. @Neil
    I never said that Java would always be faster than C++ or Assembly, I know it won’t (with the convoluted exceptions) because I’m working every day to make these things faster, so I’ll just dismiss your first point there as a strawman argument.

    What I’m saying is that you’re likely not running your microbenchmark (of which its usefullness is doubtful) properly. I’m not saying you’re faking the results, just that your methodology is flawed. Try this instead, put your loop inside a function, and call that function from main a few times, noting the speed of each call. Notice that the speed increases?

    To address your last point, yes, Java can do those optimizations, and much much more. It’s not absurd to expect it to do better on more complex code at all. With more function calls, there more to inline (either fully or partially), there’s more to pull out of loops, there’s more information to get better aliasing. Generally, there’s just more to work with, and more opportunity for improvement. Why do you think compilers are compared with huge SPEC benchmarks with a lot of data, and not just a single tight loop?

  5. You’re missing the point of that code.

    There’s no reason for Java to choose to do a bad job of encoding *2^n and /2^n, then change its mind and do a good job. They’re both extremely common and should be encoded right the first time instead of spending exactly the same amount of time encoding them poorly, wasting seconds of computation time due to the poor encodings, then wasting more time to re-encode them. These sorts of tiny optimizations with huge impact can even be done with text replacement at compile time instead of at run time. That suggests that someone up the chain either intentionally chose to ignore the obvious, had absolutely no clue, or wanted Java to have poor performance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: