Vulkan: Command Buffer Management

Let’s pick up where we left off of a few weeks ago and continue to discuss some of the performance critical parts of our Vulkan render backend. Today’s topic is management of command buffers and we’ll kick off with a quick overview of the user’s responsibilities when when working with command buffers in Vulkan.

In Vulkan any work to be conducted by the GPU is first recorded into command buffers ( vkCommandBuffer) that are submitted for execution using vkQueueSubmit(). Command buffers are allocated from command pools (VkCommandPool). Pools are responsible for backing the command buffers with memory and are externally synchronized, meaning that it is the API user who is responsible to make sure that there are no concurrent accesses to a pool from multiple threads. This also applies when recording into command buffers coming from the same pool as they will request more memory from the pool when they need to grow.

When one or many command buffers are submitted for execution the API user has to guarantee to not free the command buffers, or any of the resources referenced in the command buffers, before they have been fully consumed by the GPU.

Practically, what this means is that each worker thread needs its own VkCommandPool to allocate command buffers from. What it also means is that we must carefully schedule destruction / recycling of command buffers and any resources referenced therein to not happen before the command buffers have been fully consumed by the GPU. When submitting command buffers for execution we can ask Vulkan to signal a VkFence when all the command buffers associated with the submit have been consumed, when the fence is signaled we know it’s safe to recycle or destroy any referenced resources as well as the buffers themselves.

In Vulkan there are two types of command buffers: Primary and Secondary. As this post isn’t intended to be a tutorial on Vulkan I won’t cover the difference between them in any detail, but the basic idea is that you will be using primary buffers to coordinate bigger state changes (like switching between render passes) and secondary **command buffers will be used in your worker threads when generating lots of draw calls within a render pass. Secondary command buffers are then scheduled for execution by calling them from primary command buffers using vkCmdExecuteCommands().

To manage this in our Vulkan render backend we use a simple FIFO queue. For every vkQueueSubmit we write an arbitrary length command to the queue with a header that looks like this:

typedef struct command_buffers_header_o
{    
    // Which queue (graphics, compute, transfer) that this submit targets
    uint32_t queue;
    // Complete semaphore to (optionally) signal
    VkSemaphore complete_semaphore;
    // Fence to be signaled by Vulkan when the buffers have been fully consumed
    VkFence complete_fence;
    // Number of primary command buffers following this header
    uint32_t num_primary_buffers;
    // Number of secondary command buffers following this header
    uint32_t num_secondary_buffers;
    // Number of descriptor pools following this header
    uint32_t num_descriptor_pools;
    // Number of delete messages following this header
    uint32_t num_queued_deletes;
} command_buffers_header_o;

Immediately following the command header we pack the data associated with the command, so in memory the full command looks like this:

{
  [command_buffers_header_o]
  [num_primary_buffers * VkCommandBuffer]
  [num_secondary_buffers * VkCommandBuffer]
  [num_descriptor_pools * VkDescriptorPool]
  [num_queued_deletes * queued_resource_delete_t]
}

The length of the final command depends on context. As described in “A Modern Rendering Architecture” we build two types of graphics API agnostic command buffers on the engine side; one that deals with creation, update and destruction of render backend resources, and another that describes actual work to be done by the render backend. So depending on what type of API agnostic command buffers we are currently processing and how much work they contain, the length of the command put on the FIFO queue will differ.

The queued_resource_delete_t is just a container that references one or many Vulkan resources of a specific type, queued for destruction:

typedef enum delete_resource_type {
    RESOURCE_BUFFER,
    RESOURCE_IMAGE,
    RESOURCE_IMAGE_VIEW,
    RESOURCE_IMAGE_VIEWS,
    RESOURCE_SAMPLER,
    RESOURCE_MEMORY,
    RESOURCE_SHADER_MODULE
} delete_resource_type;

typedef struct queued_resource_delete_t
{
    delete_resource_type type;
    void *resource;
} queued_resource_delete_t;

Each frame we peek at the tail of the FIFO queue to determine if the VkEvent in the command_buffers_header_o has been reached. If so, we free any Vulkan resources in the queued_resource_delete_t array and recycle all primary and secondary VkCommandBuffers as well as all VkDescriptorPools. The recycling is done by simply copying their handles to designated recycle-arrays within each physical device:

/* carray */ VkCommandBuffer *free_primary_command_buffers[QUEUE_MAX];
VkCommandPool worker_thread_command_pools[MAX_WORKER_THREADS][QUEUE_MAX];
/* carray */ VkCommandBuffer *free_secondary_command_buffers[MAX_WORKER_THREADS][QUEUE_MAX];
/* carray */ VkDescriptorPool *free_descriptor_pools;

Note: The /* carray */ comment in the code above indicates that the pointers points to our implementation of Sean Barrett’s “Stretchy Buffer”.

We then continue advancing the tail until the queue is either empty or we reach a VkFence that hasn’t yet been signaled.

Currently we reset descriptor pools (vkResetDescriptorPool) and command buffers (vkResetCommandBuffer) on first use. Maybe it would be better to do that when we recycle them instead, as that potentially would give more freedom to Vulkan to manage its memory more efficiently. I don’t know. Neither do I know if it is better to return the VkCommandBuffers to their respective pools instead of resetting them. As with a lot of things in Vulkan, there are not that many best-practice guidelines available, and things like this are also likely to be IHV-dependent.

The FIFO queue allocates its memory for the commands in blocks of 64KB. Each time we want to put a new command on the queue we iterate through any blocks already allocated (and that we consider being worth scanning, i.e where the amount of free space is above a certain threshold). The first block we find that has enough free space to hold our command we’ll grab. If we can’t find an existing block, we’ll allocate a new one and put the command in that instead.

A good alternative to the FIFO queue would be to use a ring-buffer for this instead. My main argument for not using a ring-buffer is that we expect The Machinery to be used for a lot of different applications with very different content workloads, in which case tweaking the size of the ring-buffer to avoid stalls can be a bit annoying. Also worth noting is that we don’t expect to see a lot of pressure on this queue as there’s typically a rather low number of vkQueueSubmit() happening every frame, so performance isn’t too critical. With that said it’s not unlikely I will revisit this in the future.

So far I haven’t looked at adding support for pre-recorded secondary command buffers that can be reused across multiple frames. But from what I can tell it should be fairly trivial to support by simply treating them as regular backend resources (similar to images, buffers, sampler, etc) and keeping them completely outside of this system.

That’s it. If you have comments, suggestions or questions feel free to reach out to me on twitter @tobias_persson or similar. In my next post we’ll take a look at how we manage VkPipelines when running heavily parallel.