UI rendering using Primitive Buffers

Over the years we have used a lot of different frameworks for creating user interfaces for various editors: wxWidgets, WinForms, WPF, QT, Chromium/HTML5 and others. To be honest we’ve never really liked any of them. Most of them feel bloated and they usually drag in huge dependencies. And when they don’t behave or perform as expected it can be very hard to navigate, debug and understand these monster frameworks.

From time to time we’ve pitched the idea of building our own UI system instead of relying on some existing huge framework, but since pretty much all of the tools programmers we’ve ever worked with have recommended against it we haven’t been bold enough to go for it. But recently we have started seeing more lightweight libraries for doing immediate mode GUIs popping up (Dear IMGUI and Nuklear are two excellent examples), these have inspired us and yet again sparked our interest in building our own system.

So for The Machinery we’ve decided to ignore all recommendations and run full steam a head rolling our own UI system for the editor. This might be a decision that we will regret in the future, but then so be it! At least we’ll be in a position where we understand exactly what we regert and why. Another reason for this decision is that we have started seeing a blurring of the lines between editors and runtime and we expect this trend to continue as more editing workflows move into VR/AR.

If we back out and take a bird’s-eye view of the editor we are aiming to build it can be broken down into only a handful of components / sub-editors:

  • Windows management and a docking system
  • Widget library with standard buttons, check boxes, color picker, etc.
  • A tree view
  • A property editor
  • A graph editor

And that’s about it really. Sure, there will also be a bunch of special dialog boxes, diagnostics windows, etc. But if we do a good job with the above core components and make it easy to extend them, adding more widgets to the widget library, etc., we should hopefully be in a decent place.

Also note that since The Machinery is heavily data-driven there is not much to gain from having some kind of visual editor for doing the UI design. As most of the content in the panels won’t be known until we run the editor we are forced to dynamically generate the layouts of our UI’s anyway.

Exactly how we’ll go about building the high-level interface for creating UI’s (immediate vs retained mode, or perhaps some kind of mix?) hasn’t been completely ironed out yet, and is outside the scope of this blog post. What we do know though is that the system will be built around draw functions that generate data into buffers that the GPU will render for us. And today we will take a look at what goes into those buffers. We call them “Primitive Buffers”.

Primitive Buffers

One of the core design philosophies at Our Machinery is to always think about how to make data flow through the various systems as efficiently as possible. While we today might be a bit less afraid of wasting some memory for programmer / developer convenience (compared to what we had to be when targeting last generation consoles), we try to be cautious not to move big chunks of data around more than absolutely necessary.

If you think about immediate mode GUI systems they have a tendency to generate quite a lot of index and vertex data. On a PC this data needs to be transferred over the PCI-E bus before the GPU can execute some (usually) very simple vertex and pixel shaders. Another performance culprit to be aware of is the large amount of draw calls and state switches a naive implementation might end up generating. This might not be a big deal if the UI is simple or if it is the only thing the application has to render. But as we intend to focus on mostly rendering other things than UI and expect to have rather complex user interfaces that have to feel snappy, we’ve started playing with some alternative ideas.

One of the core modeling blocks of any UI is the rectangle. A complex user interface can easily generate tens of thousands of rectangles. Typically each rectangle would generate 4 vertices into a vertex buffer as well as 6 indices into an index buffer.

A simple vertex format for an untextured rectangle might look something like this:

struct RectVertex {
    float x, y; 
    uint8_t col[4];

sizeof(RectVertex) == 12 bytes. 4 vertices * 12 bytes = 48 bytes + 6 indices * 4 bytes = 72 bytes of data per rectangle in total.

Now, assuming you are running on a decently modern graphics API and on hardware that supports doing “raw” buffer loads in the vertex shader (using D3D’s ByteAddressBuffers or equivalent) it is easy to significantly reduce the memory footprint of this simple rectangle.

The idea is simple, just stop relying on the Input Assembler to fetch the vertex data for you. Instead construct the vertices in the vertex shader by manually loading primitive information from a Primitive Buffer guided by data encoded in the index buffer.

Let’s take the rectangle example from above and express it this way instead:

struct Rect {
  float x, y, w, h;
  uint8_t col[4];

This is what we would write to the primitive buffer. In the index buffer we still would write 6 indices but as we no longer depend on the Input Assembler to fetch the vertex data for us we don’t have to store regular vertex indices. Instead we can freely encode whatever we want into the index buffer as long as we know how to interpret that data in the vertex shader.

In the current implementation of our new UI system we use the lower 24 bits of the vertex id to store the offset into the primitive data (i.e. the Rect in this example), and the upper 8 bits as auxiliary data that guides how to interpret and generate vertices from the primitive. At the moment we use the upper 6 bits of the aux-data to encode primitive type information and the lower 2 bits to encode which rectangle corner the current vertex represents.

00000000 00000000 00000000 00000000
|     |  |
|     |  |- offset into primitive buffer
|     |- rect corner id
|- primitive type

Using this representation the size of a rectangle goes from 72 bytes down to sizeof(Rect) == 20 bytes + 6 indices * 4 bytes = 44 bytes, or ~40% less than the original approach.

You could argue that the win is less that 40% as you might be able to get away with 16-bit indices when using the vertex buffer approach (depending on the complexity of your UI). But keep in mind that this is a very simple example and the real gains don’t show up until you start mixing in more complex primitive types. Our goal is to render the entire UI with as few draw calls as possible (ideally just one) and if you are using the vertex buffer approach there is no way around padding to the worst case scenario stride, which potentially might become a rather fat vertex (adding UV coordinates, clipping and any other stuff the other primitives types might need). And then for each of these simple rectangles you are stuck with having to amplify this rather fat vertex by four! Using the primitive buffer approach this problem completely goes away as we can tightly pack primitive data of varying sizes without any padding.

Another nice thing with using primitive buffers is that the primitive representations encoded in the buffer often are close to or even identical to the representations exposed in the interface of the UI system. This eliminates the cost you have with the traditional vertex buffer approach when translating from e.g. a rectangle representation in the UI into four vertices. With primitive buffers you typically get away with a memcpy and occasionally some patch up.


For editor UI we don’t expect to need anything more complex than simple rectangular clipping, neither do we think we’ll need to clip against more than one rectangle at a time. So we’ve tried to design a simple and efficient clipping system based on that. Here’s how it works:

To activate a clip rectangle you first have to register it to the UI system. When doing so the clip rectangle gets written to the primitive buffer and its offset in the buffer is returned. Later, when you are drawing primitives and want to clip them against a previously registered rectangle, you simply provide the offset of the clip rectangle to the draw functions. To simplify the engine code we only cull the primitive against the active clip rectangle on the CPU, if we then determine that the primitive is in need of clipping the actual clipping responsibility is pushed to the GPU.

This works by storing an uint32_t containing the offset of the clip rectangle (or 0 if clipping is disabled) inside each primitive representation encoded in the primitive buffer. In the vertex shader we can then decide based on what type of primitive we are currently rendering how to best clip it. (A brute force simple implementation would be to discard any pixels outside the clip rect in the pixel shader, but for simple primitive types like rectangles you might be better off with clamping the vertex-/uv coordinates etc.)

Wrapping up

Overall we feel pretty satisfied with how this is shaping up. The primitive buffer idea reflects one of the core design ideas for The Machinery — i.e. that most systems can be designed around how data is laid out in memory and how it flows to and from other systems. If done correctly it becomes trivial to (almost transparently) shift and scale the actual computational responsibility over multiple processors, GPUs, or even remote machines.