Multi-threading is Simple
A friend asked me to do a post on distributed/parallel computing. Covering everything would be a bit much for one post, so I’d first like to address one of my biggest beefs related to multi-threading.
Specifically, many people have this bizarre assumption that multi-threading is difficult, perpetuated for no apparent reason by a few people who are evidently either too stubborn or too stupid to think and program at the same time (e.g. Joel Spolsky, the makers of SQLite). I hope to show in this post that there’s very little excuse for not multi-threading, since you can do a half-decent job of it with one remarkably simple, general structure and almost no effort.
“Threads are evil. Avoid them.” –The SQLite FAQ
Multi-threading at its core is simple in concept and practice; people just never get a good explanation of how simple it can be, so understandably, they’ll tend to be mystified by it. Perhaps the simplest way to view parallelism is on a dependency graph. The following is an example of a dependency graph for a set of tasks to be done when planning a wedding:
You can’t print invitations (task E) before you know the date (task C), otherwise the invitations might have the wrong date on them. In other words, E depends on C. Tasks that depend on each other can’t be run at the same time, because one must be done before the other. However, tasks A, D, and G (in red) do not depend on each other, they are independent, so those 3 tasks could be done by different people at the same time, possibly saving time overall. The following is one way that these tasks could be parallelized with 3 people/threads:
At all points in time, all tasks running at the same time must be independent. Notice that when task C finishes, there is nothing ready to be run yet, so that thread can’t do anything until B finishes, when both G and D can be started. This happens to be an optimal parallelization of these tasks, but supposing we only had 2 threads, here are 2 different parallelizations, one of which is not optimal:
An interesting feature of these two parallelizations is that both of them are locally optimal, in that no single task can be moved to improve the total time in either case.
All this is well and good, but how exactly would we implement something like this?
This is also simple using a workpool. A workpool is effectively any producer-consumer-like structure that represents tasks to be run. One thread is created for each CPU core, and they each look for work in the pool. This can sometimes be as simple as a single integer that is the index of the next task to be done in an array of completely independent tasks. A workpool can even be completely implicit, such as when the entire being of what each thread is to do can be determined from their thread number (e.g. random sampling with each thread having a different seed, or each traversing a separate length of array). However, any thread-level parallelism can be expressed in terms of just a workpool, though it is often much more efficient to use other control structures as well or alternatively.
The example above is actually one of the more complicated cases, where each task could have a list of tasks that depend on it (e.g. task G would have a list pointing to E, H, and I), and when the task finishes, it could decrement a counter on each of those tasks, indicating how many dependencies it’s waiting for. If the number of dependencies some task is waiting for reaches zero, it can be added to the workpool. Then, if there are any, some task from the workpool would be chosen to run in the current thread.
All of the above parallelizations can be obtained by that general workpool algorithm.
There are even some impressive properties that have been proven about workpools. For example, if all of the tasks take the same time (in practice, if variation in task length is very small compared to the total time), any way of selecting tasks from the workpool will be at most 2x slower than the best, no matter what the dependencies, and on average, should be much better than that. If there are few dependencies and a very large number of tasks, it becomes easy to obtain near-perfect speedup on many CPU cores.
There is a small caveat, for those paying close attention, and it’s from unintended dependencies. For example, something as simple as decrementing the counter of how many tasks to wait for causes dependencies. What if two threads try to decrement the counter at the same time? Both read the value 5, subtract 1, and store the value 4, but if they’d been just a few clock cycles apart in timing, the value of the counter would be 3, since whichever one occurred second effectively depended on whichever one occurred first. Worse yet is the multi-instruction operation of “adding a task to the workpool”.
This requires a mechanism for performing atomic operations, ones that cannot be divided into separate operations or overlapped by dependent operations. Luckily, Intel and AMD provide an instruction prefix (the “lock” prefix) to make certain processor instructions atomic. With these, we can construct multi-instruction atomic operations, a workpool, any other thread control mechanism, or even a thread scheduler, which doesn’t have the luxury of having its own thread.
Nonetheless, with just a functioning workpool implementation, it’s possible to do amazing stuff in parallel. It very simply (i.e. 1 line of code and a bit of forethought) cuts down the post-processing time of AQUA@Home from over a day to a few hours on a Core i7 (8 logical CPU cores). Workpools of some sort are also used in some form for all parallelism in the AQUA@Home applications. With a careful way of selecting the next thing to run from a workpool, I can guarantee that one of the applications now spends at most ~10 seconds at the end not using all CPU cores 100%. I also parallelized assembling down to the function level just by putting all the functions into a type of workpool in Inventor IDE.
I haven’t shown detail on a specific example here, but next time, I’ll walk through the parallelism of the main AQUA@Home application, presented in this paper I co-authored.