The Machinery Shader System (part 2)

It’s time for the second part in my series of posts covering the shader system in The Machinery. In the wrap up of part 1 I said I would pick up and describe something I referred to as the “shader system IO”. But that was six weeks ago and as with most things in software development, predictions of the future tend to fail, and that statement was no exception.

Turns out there’s a lot of other, more low-level, things to get out of the way first, before it makes sense to talk about abstractions for handling generation of shader variations, or strategies for dealing with varying update frequencies of constants and resources.

Shader Declarations

So today we will instead begin by looking at an API that I call tm_shader_declaration_api, which exposes an interface for declaring:

  • Shader code.
  • Stage to stage linking information.
  • Input and outputs in the form of resources and constants.
  • State blocks.

The data is stored in an object called tm_shader_decalaration_o and all data is optional. The final high-level representation of a shader is assembled by combining multiple of these declaration objects together. At the moment this is done by simply stacking a number of declarations on top of each other with a set of special merge rules, but in the future I imagine that some of these declarations will represent nodes exposed to a graph editor, and the actual combination will then become more of a link step driven by an artist authored graph.

On top of the API there’s also a JSON front-end, providing a data-driven way to populate the tm_shader_decalaration_o. While the API provides an identical feature set, it feels easier to get a decent overview for what all this means in practice by looking at some JSON, so let’s do that.

depth_stencil_states : {
    depth_test_enable : false
    depth_write_enable : false
    depth_compare_op : "greater"
}

raster_states : {
    polygon_mode: "fill"
}

samplers : {
    clamp_point : {
        min_filter: "point"
        max_filter: "point"
        mip_mode: "point"
        address_u: "clamp"
        address_v: "clamp"
        address_w: "clamp"        
    }
}

imports : [
    { name: "texture" type: "texture_2d" }
    { name: "near_far" type: "float2" }
    { name: "sampler" type: "sampler" sampler: "clamp_point" }
]

common : [[
    float linearize(float depth, float near, float far) {
        return (near*far) / (far - depth*(far - near));
    }
]]

vertex_shader : {
    import_system_semantics : [ "vertex_id" ]
    exports : [
        { name : "uv" type: "float2" }
    ]
    
    code : [[
        static const float4 pos[3] = {
            { -1,  1,  1, 1 },
            {  3,  1,  1, 1 },
            { -1, -3,  1, 1 }
        };
        output.position = pos[input.vertex_id];
        static const float2 uv[3] = {
            { 0,  0 },
            { 4,  0 },
            { 0,  4 }
        };
        output.uv = uv[input.vertex_id];        
        return output;
    ]]
}

pixel_shader : {
    exports : [
        { name : "color" type: "float" }
    ]

    code : [[        
        Texture2D depth_stencil_surface = get_texture();
        float2 near_far = load_near_far();
        
        output.color = linearize(
            depth_stencil_surface.Sample(get_sampler(), input.uv).x, 
            near_far.x, 
            near_far.y);
        
        return output;
    ]]
}

The above example shows a simple shader for converting clip space depth to linear depth by rendering a single triangle covering the entire screen and storing the result in a single channel f32 color target.

It’s an artificial example and most likely not how you would implement something like this in practice, but the above code contains pieces of almost all the parts that builds up a full tm_shader_declaration_o, from which we can generate a functional shader.

The first two blocks (raster_states and depth_stencil_states) simply declare a couple of render state blocks. Since I covered state blocks under the section “Low-level shader compilation” in “The Machinery Shader System (part 1)” I won’t cover it again, go check it out if it’s not self-explanatory.

The third block (sampler_states) declares named sampler state blocks, a sampler state block in itself doesn’t do anything unless it is referenced from an imports-block.

In the imports-block any resources or constants that should be accessible from the shader code has to be declared. As covered in part 1, an important goal with the shader system is to abstract away the underlying binding model for constants and resources. We want to be able to freely change it without breaking any existing shader code. This is automatically handled by the system by implicitly injecting helper functions for retrieving any constant or resource into the shader code. Constants gets prefixed with load_ and resources with get_. If an array is declared, the element index to retrieve should be passed as argument to the function.

We will revisit this in a bit more detail in the next section, for now the key take away is just to understand that from a shader authoring point-of-view you don’t have to care about how data is bound to the shader.

Next up comes a code block named common. For each tm_shader_declaration_o taking part in the compile, the contents of this block is simply concatenated together and inserted just before the code for each shader stage. The common block is mainly used for declaring helper functions and preprocessor macros.

Last but not least comes blocks for each shader stage the user wants to run. In this simple example we only have a vertex and pixel shader active, but all shader stages are of course supported.

Similar to the binding model for resources and constants, all stage-to-stage linking information is also abstracted away and automatically generated. Since we know the execution order of all active shader stages we can gather the necessary information from the previous stage exports block together with any requested system semantics (imports_system_semantic and exports_system_semantics). From that we can generate input and output structs for each stage. This removes the responsibility of packing data into interpolators from the shader author, and makes it possible to experiment with different packing strategies without breaking any existing shader code. This also means that we end up implicitly generating the function declaration for each shader stage as we will be injecting some extra code there to handle the unpacking/packing. From the shader author’s perspective this results in any varying input data becoming accessible to the shader through a struct named input , and any varying output data should be written to a struct called output.

Also worth mentioning, since it’s not used in the example above, is that within each shader stage it’s also possible to declare an optional block called stage_attributes. This block describes additional attributes that might be needed for certain shader stages, e.g. the thread group size of a compute shader ([numthreads()]), or forcing early depth stencil testing ([earlydepthstencil]).

While the above example shows a complete shader declaration, it’s important to remember that the system is encouraging code sharing by stacking multiple declarations on top of each other. I briefly mentioned in the beginning that blocks are merged together using special merge rules. For code blocks (i.e., anything within [[ and ]]) we simple concatenate the contents in the order the declarations are combined. For state blocks we merge the contents, if a unique key is declared more than once the last declaration gets precedence. This makes it possible to easily build up libraries with helper functions and state blocks which drastically reduces shader code duplication.

Binding of Constants and Resources

So from a shader author’s point of view we’ve removed the need to understand how data in the form of resources and constants gets into the shader code, the only thing they have to worry about is to remember to declare whatever constant or resource they need and then they will automatically be accessible through the generated load_() and get_() functions. But how does this work on the inside?

This is something I’ve already been iterating over quite a bit since I started working on this system.

For resources my original idea for Vulkan was to attempt a completely “bindless” approach. Basically creating a single “resource binder” (which maps to a Descriptor Set in Vulkan, see my post on “Efficient binding of shader resources”) per shader type, and then for each shader instance append any resources (textures, buffers, samplers) to unbounded arrays, one for each resource type, within the resource binder. I had some previous experience with using large arrays of textures without too much trouble so I naively started running in this direction coding away like a happy child for a couple of days until I realized that Vulkan doesn’t support indexing unbounded arrays of buffers (i.e uniform buffers and storage buffers), only textures. Sigh.

So for now I’m back at having one resource binder per shader instance. A bit annoying but not too horrible as the abstraction layer allows me to easily iterate over this later without affecting any existing shader code or any user code built on top of the shader system.

For constants I simply stuff the constants for each shader instance into a single buffer and use Vulkan’s push constants to index into the buffer to find the data for the right shader instance.

That’s about it, some care has to be taken to support a lock-free API for updating resources and constants, but nothing too complicated so I won’t cover it in this post.

Do note though that all this is a bit in flux, and not unlikely it will continue to be so. I anticipate that the actual implementation will continue to change over time as graphics APIs evolve, and as we add support for more platforms. The important architectural takeaway is that the current APIs, both from a shader author’s perspective as well as from the user of the shader system’s perspective, feel pretty sane and stable and can probably be locked down soon.

Wrap up

In my next post about the shader system we’ll continue to dive into more reasons for why I think it’s crucial to use an a abstract binding model instead of declaring resources and constants directly in the shader code. We will then enter into the land of multi-pass shaders, context-driven shader variations, and constant and resources with different update frequencies, so if you haven’t been convinced yet I still think I might be able to do so then. Stay tuned.

(But who knows, I might just as well end up writing about something completely different again, time will tell.)