Vertex Assembly and Skinning

Last summer I wrote a blog about the System’s concept of the shader system that we have in The Machinery. Most of what was described in that post is still intact with what we have today, except the anticipated variation selection rewrite that is nowadays handled using the mentioned bitmask approach.

When I came up with the tm_shader_system_o concept my primary goal was to solve problems like:

  • Generation and compilation of shader variations.
  • Runtime selection of which shader variation to use based on context.
  • Management of grouping of constants and resources with different update frequency needs.

But since then I’ve also discovered some new use cases where tm_shader_system_o acts like an abstract interface for data access, which from an engine architecture and shader authoring perspective feels interesting enough to put down in writing.

So today, we will take a closer look at that, or precisely, how we handle vertex assembly and skinning in The Machinery. I realize that this sounds like solved problems but I wanted to take a few steps back and investigate if there are alternative approaches that gives more flexibility and generates shader code that is easier to maintain compared to relying on the Input-Assembler stage (IA) doing it for us.

Vertex Assembly

In The Machinery we don’t use classic vertex buffers for retrieving vertex data, instead we load all data explicitly from ByteAddressBuffers. From a shader authoring point of view you rarely have to think about how the various vertex channels (like position, texture coordinates, normals, etc) find their way to the shader, as that code is hidden in a tm_shader_system_o that exposes a generic interface for retrieving the data that looks something like this:

tm_vertex_loader_context ctx;
init_vertex_loader_context(ctx, input.instance_id);

float4 local_pos = load_position(ctx, input.vertex_id, 0);
float2 uv = load_texcoord0(ctx, input.vertex_id, 0); 

Exactly what happens inside the load_*() functions depends on what the activated tm_shader_system_o providing the implementations decides to do. While it could be as simple as loading the data from one or many ByteAddressBuffers, it could just as well procedurally generate the result based on some metadata looked up from the instance_id and vertex_id.

With this abstract interface approach for loading vertex data we get a lot of power and flexibility:

  • We can freely load the data for any vertex of interest, not only the one we are currently processing.

  • We can expose reading from the index buffer, allowing access to the full mesh data from any shader stage.

  • We can encode, quantize or compress vertex data in any way that makes sense for the underlying engine system.

  • We can gracefully handle cases of missing vertex data (e.g. if the shader tries to load data for a certain uv-set that the underlying mesh data does not contain).

  • We can access any number of vertex data sets to easily support stuff like morph targets or similar.

This drastically simplifies the life for the regular shader authors who usually mostly care about the surface appearance. They can now write a single shader, knowing that it will work fine whatever engine object it gets assigned to.

Let’s take a quick peek at what one of the most basic Vertex Assembly interfaces might look like, the replacement for retrieving vertex data from vertex buffers instead of going through the IA:

// ...

struct tm_vertex_loader_context {
    // Bitflag indicating which vertex channels are active (upper 16 bits unused). 
    uint active_channels;
    // Total number of vertices per set in the vertex buffer.
    uint num_vertices;
    // Byte offset per channel.
    // Byte stride per channel.
bool has_channel(tm_vertex_loader_context ctx, uint semantic) {
        return ctx.active_channels & (1 << semantic);

float4 load_position(tm_vertex_loader_context ctx, uint vertex_id, uint set) {
    ByteAddressBuffer buf = get_vertex_buffer_position_buffer();

    uint offset = (set * ctx.num_vertices + vertex_id) * 
        ctx.strides[VERTEX_SEMANTIC_POSITION] + 

    return has_channel(ctx, VERTEX_SEMANTIC_POSITION) ? 
        float4(buf.Load3(offset), 1): float4(0,0,0,1);    

// ... the rest of the load_*() functions ...

The above code feels rather self-explanatory so I won’t spend any time going through it. Instead, we’ll move on and take a look at our implementation of mesh skinning in The Machinery.


Mesh skinning is typically implemented by associating each vertex to a fixed number of bone influences transforming the vertex from its “bind” position to the skinned position. This is usually handled by storing an index into an array of bone matrices and a weight in the vertex data for each influencing bone.

In The Machinery I use a somewhat different approach that is less restricted when it comes to the number of bones each vertex can be influenced by.

Instead of encoding bone indices and weights directly in the vertex data I store a single uint that I refer to as skin_data. In its upper 8 bits I store the number of influencing bones for the vertex and in the lower 24 bits I store a buffer offset to the beginning of an array holding:

struct tm_bone_influence_t {
    uint index;
    float weight;

This allows us to support a varying number of bone influences per vertex (anywhere between 0-255 bones) without wasting any space, at the price of a memory indirection.

As for selecting the correct shader variation that applies the actual skinning transforms, that is handled by activating a tm_shader_system_o called skinning_system on the engine side. The skinning_system exposes an array of bone matrices to the shader that is indexed by tm_bone_influence_t::index when looping over the influences.

As we rely on having per-pixel velocity available for TAA and other post effects, the bone matrices array contains transform matrices for both current and last frame. To reduce unnecessary memory shuffling we use a simple ping-pong scheme when updating the matrices:

Frame 0: w([f0-mat0, f0-mat1, f0-mat2, ...]) u([unused..])
Frame 1: u([f0-mat0, f0-mat1, f0-mat2, ...]) w([f1-mat0, f1-mat1, f1-mat2,...])
Frame 2: w([f2-mat0, f2-mat1, f2-mat1, ...]) u([f1-mat0, f1-mat1, f1-mat2,...])

w() -- written this frame
u() -- untocuhed from last frame

An uint2 constant declared in the skinning_system is updated every frame with the byte offsets to the start of the current and last frame bone arrays.

Final shader code for skinning becomes very straight forward, something like:

struct tm_skin_header {        
    uint num_influences;
    uint address;

void init_skin_header(tm_vertex_loader_context ctx, inout tm_skin_header skin_header,
    uint vertex_id)
    if (has_channel(ctx, VERTEX_SEMANTIC_SKIN_DATA)) {
        uint offset = vertex_id * ctx.strides[VERTEX_SEMANTIC_SKIN_DATA] + 
        uint skin_data = get_vertex_buffer_skin_data_buffer().Load(offset);
        skin_header.num_influences = (skin_data >> 24) & 0xff;
        skin_header.address = skin_data & 0xffffff;
    } else {
        skin_header.num_influences = 0;

float4 skin_point(tm_skin_header skin_header, in float4 p, uint bone_offset, 
    ByteAddressBuffer bones)
    ByteAddressBuffer buf = get_vertex_buffer_skin_data_buffer();
    float4 res = float4(0,0,0,1);
    for (uint i=0; i!= skin_header.num_influences; ++i) {
        uint2 bone_and_weight = buf.Load2(skin_header.address + i * 8);
        res += asfloat(bone_and_weight.y) * 
            mul(p, load_mat44(bones, bone_offset + bone_and_weight.x * 64));
    return res;

Component decoupling using tm_component_shader_system_i

Before I wrap up, I’d like to touch a bit on engine system decoupling. In The Machinery we use an Entity Component System (ECS), and, as there is no monolithic core in The Machinery, everything can be extended through plugins. In the ECS context that typically means that plugins expose new types of components.

A recurring situation when it comes to rendering in an ECS is that you have one component responsible for issuing some kind of graphics work in the form of draw or dispatch calls, and another component wanting to expose some kind of data to the shader assigned to those draw or dispatch calls. But since the two components might be implemented in two completely different plugins, we need this to work without the two knowing about each other.

A concrete example is the Skinning Component exposing the previously mentioned tm_shader_system_o called skinning_system, to a shader used by a draw call issued from the Render Component. Also, we don’t want to compute the bone transforms (going from bind pose to skinned pose) unless the Render Component is in view from at least one of the cameras viewing the entity.

To handle this situation the plugin author wanting to expose a tm_shader_system_o can implement an interface called tm_component_shader_system_i and register it under a name in our API Registry. This way the Render Component (and others) can enumerate all registered tm_component_shader_system_i and only call its update callback if the component survived view-frustum-culling from at least one camera.

The update callback returns a tm_component_shader_system_t that looks like this:

typedef struct tm_component_shader_system_t
    // Bitmask describing for which viewers the system should be activated.
    uint64_t active_viewer_mask;
    // Shader system to enable.
    struct tm_shader_system_o *system;
    // Optional constant buffer needed by shader system.
    struct tm_shader_constant_buffer_instance_t *constants;
    // Optional resource binder needed by shader system.
    struct tm_shader_resource_binder_instance_t *resources;
} tm_component_shader_system_t;   

The active_viewer_mask allows the component to decide for which viewer types (regular viewport cameras, shadow mapping cameras, etc) the system should be activated.

So far, this looks like a fairly promising and efficient way for handling component to component communication without coupling when it comes to rendering. In case you’ve missed it, I blogged about ECS and rendering last autumn, although in that post I mainly focused on exposing viewer- and scene global data to lots of draw- and dispatch calls.