Table of Contents
- Multi-threading Advantages
- How to do it Wrong (Dedicated Subsystem Threads)
- How to do it Right (Jobs and Batches)
- Further Considerations
Just a few years ago, we had it easy. Optimizing a game engine was all about low-level code. Machine optimizations. Instruction counts. Black magic mathematics. As technology rolled on, however, things got more complicated. CPUs got faster while memory stayed slow, invalidating many optimization techniques relying on tables and pre-computed values while new optimizations involving memory access patterns became critical to high performance applications. Deep CPU pipelines meant that branching into different code paths optimized for highly specific cases could be slower than a single generic code path that had no branches at all. The raw throughput of the modern CPU simply made many optimizations unnecessary while the increasing latency in pathological cases made yesterday’s unnecessary optimization’s into today’s top priority hot spots. Then the multi-core CPUs fell into consumers’ hands. At first it was the dual-core CPUs, many of which were simply two single-core CPUs bundled onto a single die. Then the quad-cores, and the hexa-cores, and now we’ve even got duodeci-cores. We’re in a world where much if not most of our code could be rewritten as highly specialized math kernels and executed in blindingly-fast speeds on a GPU and yet most programmers still aren’t sure how to make the most of the dual-core CPUs that are half a decade old by now.
There are essentially two reasons to use multithreading. The oldest reason, which has been confounding programmers for decades, is to get better latency behavior out of a system. If the application is processing some data then it can’t respond to other events. This is still very often seen in desktop applications that become temporarily unresponsive while processing data. Unfortunately, most programmers never really grasped the intricacies of multi-threaded programming. Many who tried simply did it horrifically wrong, often creating far more threads than necessary and bogging down the total system throughput without actually minimizing latencies in the system.
The second reason to use multithreading – and the most important for less latency-sensitive applications – is that multithreading is literally the only way to squeeze out the maximum performance on modern CPUs. CPUs aren’t getting much faster in terms of raw single-threaded performance. Instead, CPUs are becoming more and more parallel. Even phones are starting to have dual-core CPUs in them, along with beefy GPUs. Single-threaded code is limited to running at one half to one twelfth or less of the CPU’s actual capabilities.
And then of course there are applications that care about both latency and about realizing the system’s maximum performance. Applications like video games, for instance.
How to do it Wrong (Dedicated Subsystem Threads)
There’s a Right Way and a Wrong Way to do just about everything. Unfortunately, the Wrong Way always seems to be more prevalent. That’s no different with game engines and multi-threading.
I’ve seen a lot of student games, community hobbyist games, and even professional AAA engines that use multi-threading on a per-subsystem basis. These engines create a thread for each of the main subsystems of the game – logic, physics, graphics, audio, and networking – and then synchronize updates between these threads each frame.
There are several problems with this approach. The first problem is that it simply does not scale. At best, the engine can use a number of CPU cores equal to the number of subsystems in the engine, which is generally around 4 or 5. The engine gets no extra benefit from CPUs with 6, 8, 12, or more cores. The parallelism of these engines is strictly limited and they are already dated and inefficient on CPUs in consumers’ hands today.
The second problem is that the synchronization between the threads is expensive. They can’t easily share data. The physics subsystem can’t work with data that is being modified by the logic subsystem, and the graphics subsystem can’t work with data that is being modified by the physics subsystem. Each subsystem needs to completely finish an update and atomically push those changes to the next stage in the engine’s pipeline. This either requires a tremendous amount of work and extra memory for Read-Copy-Update queues (essentially requiring a copy of all relevant data in memory for each subsystem) or they require that the threads work in lockstep (which makes the entire engine behave as if it were single-threaded).
Finally, the supposed benefits of the design just don’t make sense. You don’t need physics and graphics to be able to run independently. Each complete physics update should be rendered. Each rendered frame should be dependent on a physics update. If your physics subsystem is a bottleneck then allowing the graphics subsystem to independently run isn’t actually helping anything. You actually want the subsystems to run in lockstep, but that again implies single-threaded performance if the engine is modeled with one thread per subsystem.
There’s a better way to do things, thankfully.
How to do it Right (Jobs and Batches)
Multi-threading is not new to computer science. It’s become more important in recent years thanks to the popularity of multi-core CPUs in computer PCs, but we’ve been dealing with these issues for many years. The CS community has long since understood the power and efficiency of a batch processing thread pool system. These systems open a number of threads with no specific purpose. When some processing needs to be done, the software breaks the processing into independent batches, and submits these to a queue. The thread pool then pulls batches from the queue and distributes them to the individual threads.
The result is a system that can scale from a single core to thousands of cores, depending on the size of the data to be processed and the inherent parallelization capabilities of that data. The system has no inherent limit on the number of processing cores that it can scale to.
The problem with this approach is that it requires programmers to learn some new and relatively difficult to understand (for many people, at least) techniques. Humans don’t naturally parallelize their work, so wrapping one’s head around the idea of breaking a task up and processing those pieces simultaneously is a little alien. A lot of software and algorithms that are well known and understood need to be completely redesigned and rewritten from the ground up to work with a multi-threaded batch-processing system.
It’s worth it, however. The efficiency is ideal, and the engine will be able to scale up to newer systems with ease. Games built on the engine can add more and more content without requiring a complete rewrite of the engine.
The way it works for a game engine is fairly simple in concept. The main game loop continues to be a simple lock-step pipeline that executes each of the subsystems in logical order. Logic runs, then physics runs, then graphics runs, etc. Each subsystem is then responsible for breaking its workload into batches, and submitting these jobs to the thread pool. Then the subsystem just waits for all threads to complete their work, perform any post-processing necessary (which may entail creating a new set of batches for processing, if necessary), and continue.
One huge advantage this has over the other engine threading model is that there’s no need to implement complex and inefficient synchronization mechanisms between the subsystems. It works instead just like a single-threaded engine where each subsystem can directly operate on the game objects in memory because it knows nothing else is modifying them concurrently.
The complicated part is figuring out how to break up each subsystem’s work into batches. For physics updates, it’s ideal to break up all active objects into islands, so each island can be submitted as a single batch. This works out pretty well as there are usually groups of objects in near proximity to each other but which are not close to other groups of objects. For graphics, the primary CPU processing is often just the object culling and sorting, which is largely a read-only operation, and so objects can simply be submitted in fixed-size groups to the batch system. Logic updates can be done in several different ways, depending on the specific game design implemented. It takes some work, but the payoff is huge.
This is the right way to multi-thread a game engine. In fact, it’s usually the right way to multi-thread any software application, with relatively few exceptions.
One important point to keep in mind is that some parts of an engine need low-latency but aren’t particularly parallel in nature. For instance, networking. A single dedicated thread just for networking is probably the right way to go for most game engines, even if the rest of the engine is using a thread pool and batches for everything else. Audio may also be better suited to a dedicated thread, particularly if there isn’t a lot of CPU-side processing of audio streams going on.
A second thing to consider is that some types of applications – including some types of game software – should consider using separate processes instead of multi-threading. Using separate processes offers a number of advantages in the areas of reliability and management. Processes can die and be restarted without bringing down other processes. The strict separation of memory means that corruption or leaks in one process won’t affect the others. Processes can be started and stopped as needed. They can be given different priorities. One particularly interesting advantage of multiple processes is that they can be distributed across multiple servers, possibly even ones in different data centers. For games, a prime use of a multi-process architecture would be in game servers, particularly MMO servers.</div>