It’s All About The Data

In today’s post I will try to give a few tips on how to structure your cross-platform rendering APIs to make it easier to achieve data parallelism when building command buffers for various graphics APIs.

This post will assume the input to the render backend (i.e. the system that is responsible for generating the final graphics API specific command buffers) will be some kind of graphics API agnostic command buffers, similar to what I described in “A modern rendering architecture”.

A big hurdle to overcome when trying to parallelize translating these graphics API agnostic command buffers into actual graphics API command buffers often arises from not having enough state knowledge about the graphics pipeline state at the job split points. Here’s a simple illustration of the problem:

^                  ^                  ^                 ^                  ^
|                  |                  |                 |                  |
job0               job1               job2              job3               job4

In the above much over-simplified example we have a total of 13 commands in our graphics API agnostic command buffer and we’d like to go wide and let each job translate 3 commands each into graphics API calls. For job0 everything is fine as we start with a clean slate and know the state of the graphics pipeline (i.e what render targets, shaders, etc. that are bound). But for job1 we somehow need to inherit the accumulated state changes from job0 before we can translate any commands, and for job2 we need to inherit the accumulated state changes from both job0 and job1, and so on.

This typically leads to having to do a serial scanning pass over all commands to accumulate pipeline state changes and record the starting pipeline state for each job before we can launch the jobs. While this won’t be a problem if you only have a handful of commands it might start to hurt performance when you’re dealing with tens of thousands of commands.

While I don’t have an elegant solution to this problem I do have a few tips and tricks for how to make this serial scanning pass fairly efficient and less cumbersome to implement.

1. Separate command header from command data

Keep the command header as lightweight as possible, separated from the command data. Here’s an example of what that might look like:

struct tm_renderer_command_t
    uint64_t sort_key;
    uint32_t type;
    void *data;

The sort_key is used for ordering commands across one or multiple graphics API agnostic command buffers. type is an enum describing the command type and defines how to interpret data.

By decoupling the header from the actual data for the command we can significantly reduce the amount of memory that we’ll have to touch during the state scanning pass. Our current data structure for representing a draw call is 60 bytes, and while that is a somewhat beefy command it’s uncommon to have command data smaller than 8 bytes. If that still happens there’s nothing stopping us from just shoving the command data directly into void *data.

An important observation is that the number of commands that affects pipeline state will typically be much fewer than e.g. draw commands — so our goal is mainly to find them as quickly as possible. Practically what that means is to loop over a huge number of tm_render_command_t and look at each type enum and decide if its a command that will affect the pipeline state or not, most of the time the answer will be no. So with that in mind, can we do better?

An obvious improvement to the above suggested command header layout is to move from an AoS (Array of Structures) memory layout to SoA (Structure of Arrays) memory layout:

struct tm_renderer_commands_t
    uint64_t *sort_keys;
    uint32_t *types;
    void **data;

Data oriented design 101 — since we are only interested in the type enum to decide if the command is relevant or not this rewrite will significantly speed up the loop as we will drastically decrease the number of cache misses.

To improve on this even further we record a separate bit array only describing if the command is relevant for pipeline state tracking or not. By doing that we can also move to scan 64 commands each iteration of the loop.

As an example, let’s say we have 20,000 commands that we need to scan. Using the bit array approach we only have to look at 2,500 bytes. This should be compared to 80,000 bytes looking at the full type (uint32_t) in the above SoA layout, or 480,000 bytes (!) for the original AoS memory layout.

Point being — think very carefully about your memory layouts!

2. Keep state commands as few and as monolithic as possible

As mentioned earlier, the bulk of your commands will most likely be draw commands. By keeping the number of different pipeline state commands low and monolithic we can further speed up the state scanning pass and simplify our code. A couple of concrete examples,

  1. Avoid binding render targets slot-by-slot using separate commands. Instead bind the full MRT-chain and depth stencil target in one command. Or move even further and do what I’m currently doing in The Machinery — bind a full “Render Pass”.

    A “Render Pass” in this context is, the full MRT-chain and depth stencil target but also in addition to that, the user has to provide the initial and final resource state of each render target and how to deal with loading and storing the contents of each target.

    While some might argue that this puts too much burden on the user having to provide all this information to bind a simple render target, I would argue that its better to build a high-level “convenience system” on top of the rendering API to assist with this than going down the state tracking rabbit hole. (See my post on “High-Level Rendering Using Render Graphs”.)

  2. Make the smallest representation of a “shader” something that groups not only all active shader stages, but also as much render state as possible. Try to map your graphics API agnostic representation of a shader decently close to a full “Pipeline State Object” (See: Vulkan, DX12). This can be a somewhat tricky balance to get right though, if you move too far into PSO-land you will end up forcing your users writing a bunch of boilerplate code to do even the simplest of things..

  3. Whatever you do, do not under any circumstances subdivide a state into multiple commands that might be changed with varying frequency. I.e. don’t mimic DX9-style SetRenderState() as a command. It’s definitely going to complicate your life. If you have lightweight states that that might vary frequently you are much better of coupling those with your draw commands and not spend more time than necessary in the state tracking rabbit hole.

3. Explicit tracking of resource barriers

While our render backends can scan through the graphics API agnostic command buffers to figure out all places where we need to inject various types of resource barriers, doing so is far from trivial and rather time consuming as we need to analyze exactly how each writable GPU resource is bound to all draw-/dispatch- calls.

We could build some additional structures when building our graphics API agnostic command buffers to help accelerate the process, but even then it is likely that you’ll end up being over-conservative, injecting more resource barriers than you actually need.

There’s a perfectly good reason why resource state tracking moved out of the GPU driver and was handed over to user of the graphics APIs instead — it’s a fucking PITA to do it efficiently. And that is exactly why our low-level render backend shouldn’t try to handle it either. While we might be in a somewhat better position to make assumptions about our input than the GPU driver, we will still end up having do to significant book keeping to sort out all barriers.

A better approach is to expose a resource barrier command in our rendering API and force the user of the API to explicitly specify and schedule all barriers. While it might sound like we are shuffling a lot of low-level responsibility over to the user of the API, keep in mind that most programmers wouldn’t need to worry about this as they will likely be building features on top of a more high-level system like the Render Graph system that I’ve talked about a couple of posts ago.

Wrap up

That’s it really. Right now I don’t think I have more stuff to say on this topic. If you follow the above advice and think carefully about when and how to schedule resource updates (see my last post), you should be in a good state when it comes to parallelizing your renderer without making your code overly complex.