Importance of Explicit Vectorization for CPU and GPU Software Performance
The preprint version of my first paper as a lead author is now out here. It’s been submitted to Parallel Computing for peer review. The paper walks through several optimizations I did on the AQUA@Home Quantum Monte Carlo simulation code, independent of multi-threading. The journal encourages authors to include multimedia too, so I’m looking into making brief tutorial videos (<5 minutes each) describing the optimizations.
The upshot of the paper is mostly captured by Figure 13:
This compares an Intel Core i7-965 CPU and an NVIDIA GTX-285 GPU, both of which are currently close to the highest performance single processing units available from the respective companies. For a more detailed look at the CPU versions, including two with compiler optimization off (A.1a and A.2a), Figure 15:
The main conclusions are: 1) it is sometimes possible to achieve large speedups from explicitly optimizing code on both CPU and GPU, 2) it is sometimes possible for a CPU to outperform a GPU when both are running optimized code.
CPU vs. GPU
I’ll start with the 2nd one, since it’s probably more controversial, and hence the most likely to result in angry emails or blog posts attacking my credibility. How on earth can a high-end CPU with 8 logical cores outperform a high-end GPU with 240 cores by a factor of 2.04, when data transfer time is negligible? This means that the performance of each CPU core would have to be comparable to the performance of 61.2 GPU cores in this application; a huge discrepancy to account for. This isn’t rigorous, but it goes through the primary factors involved.
Factor #1: Clock Speed. The cores on the Intel Core i7 I used in the paper run at 3.2 GHz (I turned off overclocking, since it ironically made the graphics card flake out.) The cores on the GTX-285 run at 648 MHz, faster than the GTX-295, in case you’re wondering. That’s a factor of 4.94, still a discrepancy of 12.39x left.
Factor #2: Vector vs. Scalar. Each core on the CPU spends the bulk of its time running instructions that do 4 operations at once on a vector. Each GPU instruction only does 1 operation. That factor of 4 leaves a discrepancy of 3.10x.
Factor #3: Branching. In the scalar version of the code, there’s a key branch choosing whether or not to flip the sign of a variable and update data for related variables, which is taken 28.6% of the time on average. However, if looking at 4 variables at once, if any one of them takes the branch, they must all wait, making the average probability 56.8% (computed from 115 separate probabilities, Figure 14, which is why it’s not ). For the GPU, with groups of 32 threads (called warps) sharing an instruction pointer, they must wait 82.8% of the time. This would be another factor of 1.46 (i.e. 82.8%/56.8%), but not all of the time is spent on that branch, so without knowing the breakdown of which parts took how long, one can’t be sure how much it contributes.
Factor #4: Clocks Per Instruction. This is actually many factors, but difficult to separate. The L3 cache on the Core i7 is 8MB, enough to hold some of the key data structures. On the GTX-285, the similar “shared memory” on each multi-processor is only 16KB (and manually-controlled, unlike cache), meaning that the key data structures must reside in the “device memory”, which is considerably slower, analogous to main memory from the CPU’s perspective. Also, each core of the Core i7 has multiple execution units, so it can be running multiple operations simultaneously, resulting in max throughput of 3 vector bit manipulation instructions in a single CPU clock cycle. I couldn’t find reliable clock cycle numbers for the GTX-285, but each core has only one execution unit.
Although factors #3 and #4 are not rigorous, they could reasonably account for the remaining 3.10x discrepancy. Note that factors #3 and #4 are highly application-dependent, so it would be unwise to claim that this 61.2x core-per-core performance difference applies to more than this application, but one can at least account for a 19.75x difference, from factors #1 and #2, assuming the CPU code is vectorized.
To make sure it’s absolutely clear, I’m not stating that a CPU is generally faster than a GPU, I’m stating that a GPU is not always faster than a CPU, and that comparing numbers of cores isn’t a good metric.
Explicit Performance Optimization
Many people I’ve talked to seem to think of performance optimization done explicity, i.e. in addition to implicit compiler optimization, as “beating the compiler”, or “performance tuning”, which I find quite frustrating. You don’t get a 10x speedup by doing the same thing as the compiler only better; I’d go so far as to conjecture that compilers are good enough that you can’t get such a speedup that way. The important thing to realize is that you have key knowledge that the compiler doesn’t have, such as “This code is just generating random numbers and can be replaced with faster code that would generate different random numbers.” or “It’s okay (and possible) to rearrange the order of these nodes so that groups of 4 adjacent nodes can be updated independently.”
As shown by the difference between A.1a and A.1b in the bar graph above, compiler optimizations gave a 1.51x speedup for the original code. It’s certainly nothing to sneeze at when the simulations take months to run on over 2,000 computers, but by using knowledge about the problem and algorithm at hand, as well as about the CPU’s available computing resources, it was possible to get the final 11.86x speedup on top of that. In implementation A.2a and A.2b, I wrote the random number generation code in a way that it would be easier for the compiler to vectorize, and it gave a 2.94x speedup from compiler optimizations instead of 1.51x, so it can vectorize when it’s given a chance. I’m not “beating the compiler [at a fair fight]”, since the compiler doesn’t have a hope of guessing what I know about the problem, let alone using that knowledge.
This is one of the reasons why I find the term “optimizing compiler” rather outrageous. It suggests to developers that there’s no point in optimizing, since the compiler’s already done it for them. The claims made about compilers that automagically parallelize code are even more outrageous.
If you need your code to run much faster and can’t count on that many times more cores or computers, you may not be out of luck.