Importance of Explicit Vectorization for CPU and GPU Software Performance

The preprint version of my first paper as a lead author is now out here.  It’s been submitted to Parallel Computing for peer review.  The paper walks through several optimizations I did on the AQUA@Home Quantum Monte Carlo simulation code, independent of multi-threading.   The journal encourages authors to include multimedia too, so I’m looking into making brief tutorial videos (<5 minutes each) describing the optimizations.

The upshot of the paper is mostly captured by Figure 13: 

Performance results from paper

Main performance results from Importance of Explicit Vectorization for CPU and GPU Software Performance

This compares an Intel Core i7-965 CPU and an NVIDIA GTX-285 GPU, both of which are currently close to the highest performance single processing units available from the respective companies.  For a more detailed look at the CPU versions, including two with compiler optimization off (A.1a and A.2a), Figure 15: 

CPU performance results from the paper

Relative performance of CPU implementations at 6 different levels of optimization

The main conclusions are: 1) it is sometimes possible to achieve large speedups from explicitly optimizing code on both CPU and GPU, 2) it is sometimes possible for a CPU to outperform a GPU when both are running optimized code. 

CPU vs. GPU

I’ll start with the 2nd one, since it’s probably more controversial, and hence the most likely to result in angry emails or blog posts attacking my credibility.  How on earth can a high-end CPU with 8 logical cores outperform a high-end GPU with 240 cores by a factor of 2.04, when data transfer time is negligible?  This means that the performance of each CPU core would have to be comparable to the performance of 61.2 GPU cores in this application; a huge discrepancy to account for.  This  isn’t rigorous, but it goes through the primary factors involved.

Factor #1: Clock Speed.  The cores on the Intel Core i7 I used in the paper run at 3.2 GHz (I turned off overclocking, since it ironically made the graphics card flake out.)  The cores on the GTX-285 run at 648 MHz, faster than the GTX-295, in case you’re wondering.  That’s a factor of 4.94, still a discrepancy of 12.39x left.

Factor #2: Vector vs. Scalar.  Each core on the CPU spends the bulk of its time running instructions that do 4 operations at once on a vector.  Each GPU instruction only does 1 operation.  That factor of 4 leaves a discrepancy of 3.10x.

Factor #3: Branching.  In the scalar version of the code, there’s a key branch choosing whether or not to flip the sign of a variable and update data for related variables, which is taken 28.6% of the time on average.  However, if looking at 4 variables at once, if any one of them takes the branch, they must all wait, making the average probability 56.8% (computed from 115 separate probabilities, Figure 14, which is why it’s not 1-(1 - 0.286)^4=0.740).  For the GPU, with groups of 32 threads (called warps) sharing an instruction pointer, they must wait 82.8% of the time.  This would be another factor of 1.46 (i.e. 82.8%/56.8%), but not all of the time is spent on that branch, so without knowing the breakdown of which parts took how long, one can’t be sure how much it contributes.

Factor #4: Clocks Per Instruction.  This is actually many factors, but difficult to separate.  The L3 cache on the Core i7 is 8MB, enough to hold some of the key data structures.  On the GTX-285, the similar “shared memory” on each multi-processor is only 16KB (and manually-controlled, unlike cache), meaning that the key data structures must reside in the “device memory”, which is considerably slower, analogous to main memory from the CPU’s perspective.  Also, each core of the Core i7 has multiple execution units, so it can be running multiple operations simultaneously, resulting in max throughput of 3 vector bit manipulation instructions in a single CPU clock cycle.  I couldn’t find reliable clock cycle numbers for the GTX-285, but each core has only one execution unit.

Although  factors #3 and #4 are not rigorous, they could reasonably account for the remaining 3.10x discrepancy.  Note that factors #3 and #4 are highly application-dependent, so it would be unwise to claim that this 61.2x core-per-core performance difference applies to more than this application, but one can at least account for a 19.75x difference, from factors #1 and #2, assuming the CPU code is vectorized.

To make sure it’s absolutely clear, I’m not stating that a CPU is generally faster than a GPU,  I’m stating that a GPU is not always faster than a CPU, and that comparing numbers of cores isn’t a good metric.

Explicit Performance Optimization

Many people I’ve talked to seem to think of performance optimization done explicity, i.e. in addition to implicit compiler optimization, as “beating the compiler”, or “performance tuning”, which I find quite frustrating.  You don’t get a 10x speedup by doing the same thing as the compiler only better; I’d go so far as to conjecture that compilers are good enough that you can’t get such a speedup that way.  The important thing to realize is that you have key knowledge that the compiler doesn’t have, such as “This code is just generating random numbers and can be replaced with faster code that would generate different random numbers.” or “It’s okay (and possible) to rearrange the order of these nodes so that groups of 4 adjacent nodes can be updated independently.”

As shown by the difference between A.1a and A.1b in the bar graph above, compiler optimizations gave a 1.51x speedup for the original code.  It’s certainly nothing to sneeze at when the simulations take months to run on over 2,000 computers, but by using knowledge about the problem and algorithm at hand, as well as about the CPU’s available computing resources, it was possible to get the final 11.86x speedup on top of that.  In implementation A.2a and A.2b, I wrote the random number generation code in a way that it would be easier for the compiler to vectorize, and it gave a 2.94x speedup from compiler optimizations instead of 1.51x, so it can vectorize when it’s given a chance.  I’m not “beating the compiler [at a fair fight]”, since the compiler doesn’t have a hope of guessing what I know about the problem, let alone using that knowledge.

This is one of the reasons why I find the term “optimizing compiler” rather outrageous.  It suggests to developers that there’s no point in optimizing, since the compiler’s already done it for them.  The claims made about compilers that automagically parallelize code are even more outrageous.

Conclusion

If you need your code to run much faster and can’t count on that many times more cores or computers, you may not be out of luck.

~ by Neil Dickson on April 2, 2010.

2 Responses to “Importance of Explicit Vectorization for CPU and GPU Software Performance”

  1. Hi Neil,
    My name is Nadav and I am a PhD candidate researching parallel compilers. Congrats on getting your paper publishe in parco. It is an excellent journal.

    In your article you present a very optimistic approach in regard to CPUs. I wanted to ask you, if you believe that CPUs would benefit from extended SIMD vector size.

    • Hi Nadav,
      Thanks for checking out my paper! 🙂 Alas, it has not yet been published in a journal.

      Editors at Parallel Computing refused to let the paper be reviewed, stating that 8 CPU cores wasn’t sufficiently parallel for the journal (even though it’s 32-way parallel including the 4x SIMD), and the GPU, which was sufficiently parallel for the journal, wasn’t made to look better than the CPU. It seems unscientific to refuse to let research be examined because you don’t like the outcome, but I’ll assume that wasn’t actually why they wouldn’t review it, since it’s still a large extrapolation from what they said.

      I then submitted it to Journal of Parallel and Distributed Computing, where an editor sat on it for 7 weeks not having it reviewed. I’ll assume that he just had trouble finding people willing to review it, even though we sent a list of potential reviewers after 4 weeks, since not many people are familiar with performance optimization. It’s now at least under review, so hopefully I’ll hear back at some point. 🙂

      About extended SIMD vector size, I think it would help up to a point, and that point would vary from application to application. Larrabee (Intel’s not-yet-public GPGPU) apparently has 16x SIMD, which would certainly help with some applications, but might not help with other applications. It think it should definitely be combined with multi-threading for many parallel applications, which is why they’re going for 32 or more full cores on it (i.e. 32×16 = 128 cores the way most GPU makers count them). My main point in the paper, though, was ignoring the fact that CPUs have SIMD operations in performance comparisons is like fighting a heavyweight boxer by hitting him with a sledgehammer while he’s doing his laundry.

      In terms of parallel compilers, I’d love to see a compiler that empowers the developer to write parallel code instead of insisting that they write serial-ish code and trying to shoe-horn it into running in parallel. The closest thing I’ve seen is Cilk, which is an interesting first attempt, but needs improvement. I’d love to work on the task myself, but I haven’t the time. 😦

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: