Fiber based job system

Ever since I saw Christian Gyrling’s very inspiring talk “Parallelizing the Naughty Dog Engine Using Fibers” at GDC 2015 I’ve wanted to try something similar. It has been a constant item on my TODO list for the last two years but somehow I’ve never found the time to do anything about it.

Last Monday we took the first steps of breaking down and starting to build the foundation library for our new project. We knew we wanted an efficient and easy to use job system as our ambition is that all code will run as jobs; there shouldn’t be any special controller/main threads or similar. Everything after the initial system boot-up code should be executed and scheduled through the job system.

To achieve that, the job system has to be easy to use. Being able to yield in the middle of some code when waiting for another job to complete and then recover execution without having to do lots of manual bookkeeping is key; Fibers solves exactly that.

A fiber is essentially a lightweight cooperative thread, i.e. it never gets preempted by the OS. It is the fibers’ responsibility to yield themselves to run another fiber. The Windows SDK has built-in fiber support. On other platforms you can implement fibers through longjmp or explicit stack switching assembly instructions.

There are lots of information about fibers on the internet so I won’t go into great detail in this post. (Christian does a great introduction to them in his presentation which is available for free on the GDC Vault.) Instead I wanted to show you what the current API for our new job system looks like and talk a bit about how it works on the inside:

struct tm_jobdecl_t
    void (*task)(void *data);
    void *data;

struct tm_atomic_counter_t;

struct tm_job_manager_i
    void (*run_jobs)(struct tm_jobdecl_t *jobs, uint32_t num_jobs, tm_atomic_counter_t **counter);
    void (*wait_for_counter)(tm_atomic_counter_t *counter, uint32_t value);
    // Use wait_for_counter_no_fiber when running jobs from a thread not part of the fiber system -- discuss: should this even be allowed?
    void (*wait_for_counter_no_fiber)(tm_atomic_counter_t *counter, uint32_t value);

struct tm_job_manager_i *tm_create_job_manager(struct tm_os_thread_i *thread_api, uint32_t num_worker_threads, uint32_t num_fibers, uint32_t fiber_stack_size);
void tm_destroy_job_manager(struct tm_os_thread_i *thread_api, struct tm_job_manager_i *job_manager);

From the users perspective there are only two functions to care about: run_jobs and wait_for_counter. The design of this API is almost identical to Naughty Dog’s design. run_jobs schedules a bunch of jobs to be executed and associates them with an atomic counter. The atomic counter gets initialized with the value of num_jobs. wait_for_counter waits for the atomic counter returned from a previous call to run_jobs to reach value. But instead of spin locking while waiting, we yield the executing fiber, grab a new fiber from a pool of fibers and continue to execute jobs from the job queue. The fiber we yielded from gets put on a queue of sleeping fibers together with their waiting condition (atomic counter + wanted value).

In the main loop of the job system we first check if there are any sleeping fibers that can be swapped back in (i.e. if their wait condition has been met). If that’s the case we put the currently executing fiber back on the fiber pool and yield to the sleeping fiber, recovering execution where we left it. If not, we check if there are any jobs available in the job queue and run their task function. When the task function returns we decrement the atomic counter associated with the job.

The current implementation of the job system is pretty slim, the entire system is less than 300 lines of code, and that includes two versions of a fixed-size multi-producer / multi-consumer FIFO queue. One version is my first attempt at writing a lockless queue, so far my adventures in lockless programming has failed miserably. (If anyone knows of reference code for a working lockless multi-producer / multi-consumer fixed-size FIFO queue I’d be very interested). The other version is a locked based implementation (spin-lock over an atomic bool) that I put in just to unblock myself. I intend to go back an revisit the lockfree implementation as soon as I’ve gained some more knowledge in the subject.

High performance FIFO queues are very important as they form the entire backbone for a system like this, currently the system is built on four different queues:

  • job_queue – queue holding all jobs (tm_jobdecl_t paired with a pointer to the atomic counter)
  • sleeping_fiber_queue – queue with all in-flight sleeping fibers (wake-up condition + fiber id).
  • free_fiber_indices – queue with indices of free / unused fibers in the fiber pool
  • free_counter_indicies – queue with indices of free / unused atomic counters in a pool

All queues are fixed size ring buffers. There is no memory allocations happening in the system, all memory is preallocated when the job system is booted.

For now I’ve tried to keep the system as simple as possible and avoided supporting multiple job queues with different priorities, fibers with different stack sizes, and similar. As we start using the system I anticipate that the current design will be revisited, hopefully though the system can remain lightweight and easy to overview.

When the system has matured a bit I will write a follow up post about our experiences of using a fiber based job system as our core threading model in The Machinery.

Over and out.