Overview
The RenderDevice
is essentially our abstraction layer for platform specific rendering APIs. It is implemented as an abstract base class that various rendering back-ends (D3D11, D3D12, OGL, Metal, GNM, etc.) implement.
The RenderDevice
has a bunch of helper functions for initializing/shutting down the graphics APIs, creating/destroying swap chains, etc. All of which are fairly straightforward so I won’t cover them in this post, instead I will put my focus on the two dispatch
functions consuming RenderResourceContexts
and RenderContexts
:
class RenderDevice {
public:
virtual void dispatch(uint32_t n_contexts, RenderResourceContext **rrc,
uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
virtual void dispatch(uint32_t n_contexts, RenderContext **rc,
uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};
Resource Management
As covered in the post about RenderResourceContexts
, they provide a free-threaded interface for allocating and deallocating GPU resources. However, it is not until the user has called RenderDevice::dispatch()
handing over the RenderResourceContexts
as their representation gets created on the RenderDevice
side.
All implementations of a RenderDevice
have some form of resource management that deals with creating, updating and destroying of the graphics API specific representations of resources. Typically we track the state of all various types of resources in a single struct, here’s a stripped down example from the DX12 RenderDevice
implementation called D3D12ResourceContext
:
struct D3D12VertexBuffer
{
D3D12_VERTEX_BUFFER_VIEW view;
uint32_t allocation_index;
int32_t size;
};
struct D3D12IndexBuffer
{
D3D12_INDEX_BUFFER_VIEW view;
uint32_t allocation_index;
int32_t size;
};
struct D3D12ResourceContext
{
Array<D3D12VertexBuffer> vertex_buffers;
Array<uint32_t> unused_vertex_buffers;
Array<D3D12IndexBuffer> index_buffers;
Array<uint32_t> unused_index_buffers;
// .. lots of other resources
Array<uint32_t> resource_lut;
};
As you might remember, the linking between the engine representation and the RenderDevice
representation is done using the RenderResource::render_resource_handle
. It encodes both the type of the resource as well as a handle. The resource_lut
is an indirection to go from the engine handle to a local index for a specific type (e.g vertex_buffers
or index_buffers
in the sample above). We also track freed indices for each type (e.g. unused_vertex_buffers
) to simplify recycling of slots.
The implementation of the dispatch function is fairly straight forward. We simply iterate over all the RenderResourceContexts
and for each context iterate over its commands and either allocate or deallocate resources in the D3D12ResourceContext
. It is important to note that this is a synchronous operation, nothing else is peeking or poking on the D3D12ResourceContext
when the dispatch of RenderResourceContexts
is happening, which makes our life a lot easier.
Unfortunately that isn’t the case when we dispatch RenderContexts
as in that case we want to go wide (i.e. forking the workload and process it using multiple worker threads) when translating the commands into API calls. While we don’t allow allocating and deallocating new resources from the RenderContexts
we do allow updating them which mutates the state of the RenderDevice
representations (e.g. a D3D12VertexBuffer
).
At the moment our solution for this isn’t very nice, basically we don’t allow asynchronous updates for anything else than DYNAMIC
buffers. UPDATABLE
buffers are always updated serially before we kick the worker threads no matter what their sort_key is. All worker threads access resources through their own copy of something we call a ResourceAccessor
, it is responsible for tracking the worker threads state of dynamic buffers (among other things). In the future I think we probably should generalize this and treat UPDATABLE
buffers in a similar way.
(Note: this limitation doesn’t mean you can’t update an UPDATABLE
buffer more than once per frame, it simply means you cannot update it more than once per dispatch
).
Shaders
Resources in the D3D12ResourceContext
are typically buffers. One exception that stands out is the RenderDevice
representation of a “shader”. A “shader” on the RenderDevice
side maps to a ShaderTemplate::Context
on the engine side, or what I guess we could call a multi-pass shader. Here’s some pseudo code:
struct ShaderPass
{
struct ShaderProgram
{
Array<uint8_t> bytecode;
struct ConstantBufferBindInfo;
struct ResourceBindInfo;
struct SamplerBindInfo;
};
ShaderProgram vertex_shader;
ShaderProgram domain_shader;
ShaderProgram hull_shader;
ShaderProgram geometry_shader;
ShaderProgram pixel_shader;
ShaderProgram compute_shader;
struct RenderStates;
};
struct Shader
{
Vector<ShaderPass> passes;
enum SortMode { IMMADIATE, DEFERRED };
uint32_t sort_mode;
};
The pseudo code above is essentially the RenderDevice
representation of a shader that we serialize to disk during data compilation. From that we can create all the necessary graphics API specific objects expressing an executable shader together with its various state blocks (Rasterizer, Depth Stencil, Blend, etc.).
As discussed in the last post the sort_key
encodes the shader pass index. Using Shader::sort_mode
, we know which bit range to extract from the sort_key
as pass index, which we then use to look up the ShaderPass
from Shader::passes
. A ShaderPass
contains one ShaderProgram
per active shader stage and each ShaderProgram
contains the byte code for the shader to compile as well as “bind info” for various resources that the shader wants as input.
We will look at this in a bit more detail in the post about “Shaders & Materials”, for now I just wanted to familiarize you with the concept.
Render Context translation
Let’s move on and look at the dispatch for translating RenderContexts
into graphics API calls:
class RenderDevice {
public:
virtual void dispatch(uint32_t n_contexts, RenderContext **rc,
uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};
The first thing all RenderDevice
implementation do when receiving a bunch of RenderContexts
is to merge and sort their Commands
. All implementations share the same code for doing this:
void prepare_command_list(RenderContext::Commands &output, unsigned n_contexts, RenderContext **contexts);
This function basically just takes the RenderContext::Commands
from all RenderContexts
and merges them into a new array, runs a stable radix sort, and returns the sorted commands in output
. To avoid memory allocations the RenderDevice
implementation owns the memory of the output buffer.
Now we have all the commands nicely sorted based on their sort_key
. Next step is to do the actual translation of the data referenced by the commands into graphics API calls. I will explain this process with the assumption that we are running on a graphics API that allows us to build graphics API command lists in parallel (e.g. DX12, GNM, Vulkan, Metal), as that feels most relevant in 2017.
Before we start figuring out our per thread workloads for going wide, we have one more thing to do; “instance merging”.
Instance Merging
I’ve mentioned the idea behind instance merging before [1,2], basically we want to try to reduce the number of RenderJobPackages
(i.e. draw calls) by identifying packages that are similar enough to be merged. In Stingray “similar enough” basically means that they must have identical inputs to the input assembler as well as identical resources bound to all shader stages, the only thing that is allowed to differ are constant buffer variables. (Note: by todays standards this can be considered a bit old school, new graphics APIs and hardware allows to tackle this problem more aggressively using “bindless” concepts. )
The way it works is by filtering out ranges of RenderContexts::Commands
where the “instance bit” of the sort_key
is set and all bits above the instance bit are identical. Then for each of those ranges we fork and go wide to analyze the actual RenderJobPackage
data to see if the instance_hash
and the shader are the same, and if so we know its safe to merge them.
The actual merge is done by extracting the instance specific constants (these are tagged by the shader author) from the constant buffers and propagating them into a dynamic RawBuffer
that gets bound as input to the vertex shader.
Depending on how the scene is constructed, instance merging can significantly reduce the number of draw calls needed to render the final scene. The instance merger in itself is not graphics API specific and is isolated in its own system, it just happens to be the responsibility of the RenderDevice
to call it. The interface looks like this:
namespace instance_merger {
struct ProcessMergedCommandsResult
{
uint32_t n_instances;
uint32_t instanced_batches;
uint32_t instance_buffer_size;
};
ProcessMergedCommandsResult process_merged_commands(Merger &instance_merger,
RenderContext::Commands &merged_commands);
}
Pass in a reference to the sorted RenderContext::Commands
in merged_commands
and after the instance merger is done running you hopefully have fewer commands in the array. :)
You could argue that merging, sorting and instance merging should all happen before we enter the world of the RenderDevice
. I wouldn’t argue against that.
Prepare workloads
Last step before we can start translating our commands into state / draw / dispatch calls is to split the workload into reasonable chunks and prepare the execution contexts for our worker threads.
Typically we just divide the number of RenderContext::Commands
we have to process with the number of worker threads we have available. We don’t care about the type of different commands we will be processing and trying to load balance differently. The reasoning behind this is that we anticipate that draw calls will always represent the bulk of the commands and the rest of the commands can be considered as unavoidable “noise”. We do, however, make sure that we don’t do less than x-number of commands per worker threads, where x can differ a bit depending on platform but is usually ~128.
For each execution context we create a ResourceAccessors
(described above) as well as make sure we have the correct state setup in terms of bound render targets and similar. To do this we are stuck with having to do a synchronous serial sweep over all the commands to find bigger state changing commands (such as RenderContext::set_render_target
).
This is where the Command::command_flags
bit-flag comes into play, instead of having to jump around in memory to figure out what type of command the Command::head
points to, we put some hinting about the type in the Command::command_flags
, like for example if it is a “state command”. This way the serial sweep doesn’t become very costly even when dealing with large number of commands. During this sweep we also deal with updating of UPDATABLE
resources, and on newer graphics APIs we track fences (discussed in the post about Render Contexts).
The last thing we do is to set up the execution contexts with create graphics API specific representations of command lists (e.g. ID3D12GraphicsCommandList
in DX12),
Translation
When getting to this point doing the actual translation is fairly straight forward. Within each worker thread we simply loop over its dedicated range of commands, fetch its data from Command::head
and generate any number of API specific commands necessary based on the type of command.
For a RenderJobPackage
representing a draw call it involves:
- Look up the correct shader pass and, unless already bound, bind all active shader stages
- Look up the state blocks (Rasterizer, Depth stencil, Blending, etc.) from the shader and bind them unless already bound
- Look up and bind the resources for each shader stage using the
RenderResource::render_resource_handle
translated through theD3D12ResourceAccessor
- Setup the input assembler by looping over the
RenderResource::render_resource_handles
pointed to by theRenderJobPackage::resource_offset
and translated through theD3D12ResourceAccessor
- Bind and potentially update constant buffers
- Issue the draw call
The execution contexts also holds most-recently-used caches to avoid unnecessary binds of resources/shaders/states etc.
Note: In DX12 we also track where resource barriers are needed during this stage. After all worker threads are done we might also end up having to inject further resource barriers between the command lists generated by the worker threads. We have ideas on how to improve on this by doing at least parts of this tracking when building the RenderContexts
but haven’t gotten around looking into it yet.
Execute
When the translation is done we pass the resulting command lists to the correct queues for execution.
Note: In DX12 this is a bit more complicated as we have to interleave signaling / waiting on fences between command list execution (ExecuteCommandList
).
Next up
I’ve deliberately not dived into too much details in this post to make it a bit easier to digest. I think I’ve manage to cover the overall design of a RenderDevice
though, enough to make it easier for people diving into the code for the first time.
With this post we’ve reached half-way through this series, we have covered the “low-level” aspects of the Stingray rendering architecture. As of next post we will start looking at more high-level stuff, starting with the RenderInterface
which is the main interface for other threads to talk with the renderer.