After upgrading my particle system, the next part that needs my attention is my deferred renderer.
It has all type of lights (using compute shader when possible, pixel shaders if light also has shadows), and all the standard usual suspects (HBAO, Dof, Bokeh, Tonemap...)
Now I started to upgrade my shaders for the next level:
When we request a resource, the pool will check if a buffer with the required flags/stride is available, if this is the case, mark it as "In Use" and return it. If no buffer matching specifications is found, create and return a new one.
This scheme is quite popular (I've seen this in Unity and many other code bases), has an advantage of being simple, but also has some issues.
First, you need to keep a pool per resource type, eg: one for textures (of each type), one for buffers.
Second, the biggest disadvantage (specially for Render Targets), we need an exact size and format.
We can certainly optimize format support using Typeless resources (but you still need to create views in that case), but for size that's a no go (or at least a non practical thing), since we would often need to implement our own sampler (which is not a big deal, except for Anisotropic, but that would badly pollute our shader code base). Also we would need a bit of viewport/scissor gymnastic. Again, not that hard but really not convenient.
So if you render 3 scenes, each with different resolutions, your pool starts to collect a lot of resources of different sizes, your resource lists become bigger....
Of course you can clear any unused resources from time to time (eg: Dispose anything that has not been used), add a number of frames since not used and threshold that (yay, let's write a garbage collector for GPU resources, hpmf....).
Nevertheless, I find that for "Frame lifetime" (and eventually a small subset of "Scene lifetime") resources, this model fits reasonably well, so I'll definitely keep it for a while (I guess DX12 Heaps will change that part, but let's keep DirectX12 for later posts ;)
So now we have clearly seen the problem, for my scoped resources, if I go back to the ao->dof->bokeh case, I have to create 6 targets + one buffer (one of them can be reused, but lot of intermediates are in different formats depending on which port processing I'm currently applying)
Adding a second scene with a different resolution, that's of course 12 targets.
One main thing is, all this post processing is not applied at the same time (since your GPU serializes commands anyway). So all that memory could be happily shared. But we haven't got enough fine tuned access to gpu memory for this (again, resources pointing to the same locations are now trivial in dx12, but here still in dx11). So it looks like a dead end.
In the mean time in the Windows 8.1 world, some great Samaritan (s) have introduced a few new features, one of them called Tiled Resources.
Basically a resource created as Tiled has no initial memory, you have to map Tiles (which are 64k chunks of memory) to them. Memory tiles are provided by a buffer created with a tile pool attribute.
So you can create a huge resource with no memory, and assign tiles to some parts of it depending on your scene.
This of course has a wide use for games (terrain/landscapes streaming, large shadow map rendering), and most examples follow that direction (check for sparse textures if you want documentation about those).
Then I noticed in one slide (forgot if it was from NVidia or Microsoft), "Tiled resources can eventually be used for more aggressive memory packing of short lived data".
There was no further explanation, but that sounds like my use case, so obviously, let's remove the "eventually" word and try it (here understand : have it working).
So in that case we think about it in reverse. Instead of having a large resources backed by a pool and which is partly updated/cleared, We provide a back end, and allocate the same tiles to different resources (of course they need to belong to different unit of work, the ones that need to be used at the same time must not overlap).
So let's do a first try, and create two Tiled buffers, which point to the same memory location.
Here I show the arguments of CreateTiled static constructor, for readability.
And we need to provide a backend for it:
Now we assign the same tile(s) to each buffer like this:
We do this for each buffer
Next a simple test for it, create an Immutable resource, with some random data, same size as our tiled buffer (not pool)
Use copy resource on either Buffer1 or Buffer2 (not both).
Create two staging resources s1 and s2
Readback data from buffer1 into s1 and Buffer2 into s2.
Surprise, we then have data uploaded and copied from our immutable buffer in both s1 and s2.
So now we have the proof of concept working, we create a small tile pool (with an inital memory allocation).
Now for each post processor, we register our resources (and update tile offset accordingly to avoid overlap, we also need to eventually pad buffers/textures since our start location needs to be aligned to a full tile).
Before that, we need to check if our pool is big enough, and resize if required, the beauty of it, this is not destructive (increasing pool size will allocate new Tiles but will preserve mappings of existing ones), so this is done as:
And that's more or less it, now our pool size is equal to the largest unit of work instead of adding between them.
It simply means, in the case above, that our pool size is the sum of the size of the largest effector (which is hbao in the example).
Adding another post processing chain somewhere else will reuse that same memory for all the short lived objects, so if I start to have 5 scenes rendering that's a quite significant gain.
As a side note registration looks (for now) this way:
Now of course I had a performance check Pool vs Pool, which ended up on a draw (so I keep the same performances, no penalty, always a good thing), and here is a small memory profiling.
I render one scene in full hd + hdr, and a second scene in half hd + hdr
Case 1 (old pool only):
Case 2 (starting to use shared pool):
It has all type of lights (using compute shader when possible, pixel shaders if light also has shadows), and all the standard usual suspects (HBAO, Dof, Bokeh, Tonemap...)
Now I started to upgrade my shaders for the next level:
- Msaa support with subpixel (sample shading)
- Allow better AO usage (AO Factor can be set as low quality or each light can have an AO factor)
- Better organization of some calculations (to avoid them to be done twice).
- Some other cool things ;)
Before I start to revamp the glue code to handle all this, as soon as you start to use Msaa targets (this is no news of course), your memory footprint grows quite drastically.
In my use case, since I'm not dealing with the usual "single level game scenario", I can also have several full HD (or more, or less) which all need to be rendered every frame and composited.
I looked a bit at my memory usage, and while it's not too bad (reasonably efficient usage of pools and temporary resources), I thought I could start to have a proper think about it before to start coding ;)
So generally when we render scenes, we have several type of resources lifetimes:
- "Scene lifetime" : Those are resources which live with your scene, so until you decide to unload your whole scene, those resources must live. A good example is some particle buffers, as they are read write, they need to be persisted across frames.
- "Frame lifetime" : Those are the ones that we use for a single frame, often some intermediate results, that needs to be persisted across a sufficiently long part in the frame duration. For example, Linear Depth is quite often required for a long part in your post processing pipeline, since it's used by a decent amount of post processors.
- "Scoped lifetime" : Those have a very short lifetime (generally within a unit of work/function call)
When I did a first memory profile test, I could see that actually a lot of my memory footprint is caused by those Scoped resources, so I decided to first focus on those.
So as a starter, here are my local resources for some of my post processors
- Depth of field: 1 target for CoC (R16f), 1 target for temp blur (4 channels, format is renderer dependent)
- Hbao : 4 targets for AO + blur (2 are R16f, other 2 are R16G16f)
- Bokeh : 1 buffer for sprite filtering, 1 target for overdraw (renderer dependent).
Now for those short lived resources, you can handle them in the following way.
- Create/Dispose : you create resources every time they are needed and release become to leave the function, in c# code, this would look like the traditional use pattern as:
Code Snippet
- using (var buffer = DX11StructuredBuffer.CreateAppend<float>(device, 1024))
- {
- //Do some work
- } //Buffer is now disposed
While this is a natural pattern in c#, it is not designed to work well with real time graphics (resource creation is expensive, and creating / releasing gpu resources all the time is not such a good idea, memory fragmentation looming).
- Resource Pool : instead of creating resources all the time, to create a small wrapper around it to keep a isLocked flag. This looks this way in c#:
Code Snippet
- var buffer = Device.ResourcePool.LockStructuredBuffer<float>(1024);
- //do something with buffer : buffer.Element.ShaderView;
- buffer.UnLock(); // Markas free to reuse
When we request a resource, the pool will check if a buffer with the required flags/stride is available, if this is the case, mark it as "In Use" and return it. If no buffer matching specifications is found, create and return a new one.
This scheme is quite popular (I've seen this in Unity and many other code bases), has an advantage of being simple, but also has some issues.
First, you need to keep a pool per resource type, eg: one for textures (of each type), one for buffers.
Second, the biggest disadvantage (specially for Render Targets), we need an exact size and format.
We can certainly optimize format support using Typeless resources (but you still need to create views in that case), but for size that's a no go (or at least a non practical thing), since we would often need to implement our own sampler (which is not a big deal, except for Anisotropic, but that would badly pollute our shader code base). Also we would need a bit of viewport/scissor gymnastic. Again, not that hard but really not convenient.
So if you render 3 scenes, each with different resolutions, your pool starts to collect a lot of resources of different sizes, your resource lists become bigger....
Of course you can clear any unused resources from time to time (eg: Dispose anything that has not been used), add a number of frames since not used and threshold that (yay, let's write a garbage collector for GPU resources, hpmf....).
Nevertheless, I find that for "Frame lifetime" (and eventually a small subset of "Scene lifetime") resources, this model fits reasonably well, so I'll definitely keep it for a while (I guess DX12 Heaps will change that part, but let's keep DirectX12 for later posts ;)
So now we have clearly seen the problem, for my scoped resources, if I go back to the ao->dof->bokeh case, I have to create 6 targets + one buffer (one of them can be reused, but lot of intermediates are in different formats depending on which port processing I'm currently applying)
Adding a second scene with a different resolution, that's of course 12 targets.
One main thing is, all this post processing is not applied at the same time (since your GPU serializes commands anyway). So all that memory could be happily shared. But we haven't got enough fine tuned access to gpu memory for this (again, resources pointing to the same locations are now trivial in dx12, but here still in dx11). So it looks like a dead end.
In the mean time in the Windows 8.1 world, some great Samaritan (s) have introduced a few new features, one of them called Tiled Resources.
Basically a resource created as Tiled has no initial memory, you have to map Tiles (which are 64k chunks of memory) to them. Memory tiles are provided by a buffer created with a tile pool attribute.
Code Snippet
- BufferDescription bdPool = new BufferDescription()
- {
- BindFlags = BindFlags.None,
- CpuAccessFlags = CpuAccessFlags.None,
- OptionFlags = ResourceOptionFlags.TilePool,
- SizeInBytes = memSize,
- Usage = ResourceUsage.Default
- };
So you can create a huge resource with no memory, and assign tiles to some parts of it depending on your scene.
This of course has a wide use for games (terrain/landscapes streaming, large shadow map rendering), and most examples follow that direction (check for sparse textures if you want documentation about those).
Then I noticed in one slide (forgot if it was from NVidia or Microsoft), "Tiled resources can eventually be used for more aggressive memory packing of short lived data".
There was no further explanation, but that sounds like my use case, so obviously, let's remove the "eventually" word and try it (here understand : have it working).
So in that case we think about it in reverse. Instead of having a large resources backed by a pool and which is partly updated/cleared, We provide a back end, and allocate the same tiles to different resources (of course they need to belong to different unit of work, the ones that need to be used at the same time must not overlap).
So let's do a first try, and create two Tiled buffers, which point to the same memory location.
Code Snippet
- SharpDX.Direct3D11.BufferDescription bd = new BufferDescription()
- {
- BindFlags = BindFlags.ShaderResource | BindFlags.UnorderedAccess,
- CpuAccessFlags = CpuAccessFlags.None,
- OptionFlags = ResourceOptionFlags.BufferAllowRawViews | ResourceOptionFlags.Tiled,
- SizeInBytes = memSize,
- Usage = ResourceUsage.Default,
- StructureByteStride = 4
- };
- SharpDX.Direct3D11.BufferDescription bd2 = new BufferDescription()
- {
- BindFlags = BindFlags.ShaderResource | BindFlags.UnorderedAccess,
- CpuAccessFlags = CpuAccessFlags.None,
- OptionFlags = ResourceOptionFlags.Tiled,
- SizeInBytes = memSize,
- Usage = ResourceUsage.Default,
- StructureByteStride = 4
- };
- var Buffer1 = DX11StructuredBuffer.CreateTiled<int>(device, elemCount);
- var Buffer2 = DX11StructuredBuffer.CreateTiled<int>(device, elemCount);
Here I show the arguments of CreateTiled static constructor, for readability.
And we need to provide a backend for it:
Code Snippet
- SharpDX.Direct3D11.BufferDescription bdPool = new BufferDescription()
- {
- BindFlags = BindFlags.None,
- CpuAccessFlags = CpuAccessFlags.None,
- OptionFlags = ResourceOptionFlags.TilePool,
- SizeInBytes = 65536,
- Usage = ResourceUsage.Default
- };
- var bufferPool = new SharpDX.Direct3D11.Buffer(device, bdPool);
Now we assign the same tile(s) to each buffer like this:
Code Snippet
- var rangeFlags = new TileRangeFlags[] { TileRangeFlags.None };
- context.Context.UpdateTileMappings(resource, 1, new TiledResourceCoordinate[] { }, new TileRegionSize[] { }, this.tilePoolBuffer, 1, rangeFlags, new int[] { 0 }, new int[] { }, TileMappingFlags.None);
We do this for each buffer
Next a simple test for it, create an Immutable resource, with some random data, same size as our tiled buffer (not pool)
Use copy resource on either Buffer1 or Buffer2 (not both).
Create two staging resources s1 and s2
Readback data from buffer1 into s1 and Buffer2 into s2.
Surprise, we then have data uploaded and copied from our immutable buffer in both s1 and s2.
So now we have the proof of concept working, we create a small tile pool (with an inital memory allocation).
Now for each post processor, we register our resources (and update tile offset accordingly to avoid overlap, we also need to eventually pad buffers/textures since our start location needs to be aligned to a full tile).
Before that, we need to check if our pool is big enough, and resize if required, the beauty of it, this is not destructive (increasing pool size will allocate new Tiles but will preserve mappings of existing ones), so this is done as:
Code Snippet
- context.Context.ResizeTilePool(this.tilePoolBuffer, newPageCount * PageSize);
And that's more or less it, now our pool size is equal to the largest unit of work instead of adding between them.
It simply means, in the case above, that our pool size is the sum of the size of the largest effector (which is hbao in the example).
Adding another post processing chain somewhere else will reuse that same memory for all the short lived objects, so if I start to have 5 scenes rendering that's a quite significant gain.
As a side note registration looks (for now) this way:
Code Snippet
- this.Device.SharedTiledPool.BeginCollect();
- lindepth = this.Device.SharedTiledPool.PlaceRenderTarget(context, DepthTexture.Width, DepthTexture.Height, Format.R32_Float);
- rthbaox = this.Device.SharedTiledPool.PlaceRenderTarget(context,DepthTexture.Width, DepthTexture.Height, Format.R16_Float);
- rthbaoy = this.Device.SharedTiledPool.PlaceRenderTarget(context,DepthTexture.Width, DepthTexture.Height, Format.R16G16_Float);
- rtblurx = this.Device.SharedTiledPool.PlaceRenderTarget(context, DepthTexture.Width, DepthTexture.Height, Format.R16G16_Float);
- rtblury = this.Device.SharedTiledPool.PlaceRenderTarget(context, DepthTexture.Width, DepthTexture.Height, Format.R16_Float);
- this.Device.SharedTiledPool.EndCollect();
Now of course I had a performance check Pool vs Pool, which ended up on a draw (so I keep the same performances, no penalty, always a good thing), and here is a small memory profiling.
I render one scene in full hd + hdr, and a second scene in half hd + hdr
Case 1 (old pool only):
- Global pool : 241 459 200
- Tiled Pool: 0
Case 2 (starting to use shared pool):
- Global pool: 109 670 400
- Tiled pool : 58 064 896
- Total memory: 167 735 296
So ok, 70 megs gain in a day where some people will say that you have 12 gigs of ram in a Titan card is meaningless, but well :
- Most people don't have a Titan card. (I normally plan on 4gb cards when doing projects).
- Adding a new scene will not change the tiled pool size, and increase the global pool in a much smaller fashion.
- If you start to add Msaa or render 3x full HD, you can expect a larger gain
- When you start to have a few assets in the Mix (like a 2+ gigs car model, never happened to me did it? ;) a hundred meg can make a huge difference.
- For cards that don't support tiled resources, the technique is really easily swappable, so it's not a big deal to fallback to global pool if feature is not supported (or leave the user decide).
- I applied it quickly as a proof of concept and only on 3 effectors, now this work I can also more aggressively optimize the post processing pipeline chain in the same way (and actually also anything that needs temporary resources in my general rendering, and there's a lot).
That's it for now, as a side note, I have some other pretty cool use cases for this, they will likely end up here when I'll have implemented them.