For a while I really wanted to know how much my User interface rendering costs me in pure reality.
I can consider my ui pretty smooth, but smoother is better than smooth ;)
So to debug Direct2d, you can use the Visual Studio Graphics debugger, but since I have several windows/panels, It's quite hard to get a screenshot.
Also I noticed that sometimes a snapshot doesn't include all the elements.
Technically I know that Direct2d uses a DirectX11 (with feature level 10) device, so if I can get this device pointer, I can easily run queries between those BeginDraw/EndDraw calls.
But it looks like it's not possible....
However...
It is now possible to provide our own DirectX11 device to a Direct2d context. I looked at that feature earlier on and thought it would be really complicated to update, but this was so simple that I feel embarrassed that I didn't do it before ;)
So reasoning is simple, instead of creating a HwmdRenderTarget, we create a SwapChain using our DirectX11 device instead.
Then we can create a Direct2d Render Context from this:
Yes, this is that easy.
Context is create from SwapChain, so it will use the device which own that one, instead of creating a new one.
Also since our context implements Direct2d RenderTarget, there was no code update for the rest of the ui rendering, life is good at times ;)
There's only one difference now, calling EndDraw does not trigger a Present call on the swapchain (technically I could use a standard render target instead of a SwapChain), so you have to call Present on your swapchain manually (not a big deal really ;)
So now I have device context working that way, I can just create a Pipeline Statistics and a TimeStamp query.
I'm pretty interested in primitive count and render time obiously.
I can consider my ui pretty smooth, but smoother is better than smooth ;)
So to debug Direct2d, you can use the Visual Studio Graphics debugger, but since I have several windows/panels, It's quite hard to get a screenshot.
Also I noticed that sometimes a snapshot doesn't include all the elements.
Technically I know that Direct2d uses a DirectX11 (with feature level 10) device, so if I can get this device pointer, I can easily run queries between those BeginDraw/EndDraw calls.
But it looks like it's not possible....
However...
It is now possible to provide our own DirectX11 device to a Direct2d context. I looked at that feature earlier on and thought it would be really complicated to update, but this was so simple that I feel embarrassed that I didn't do it before ;)
So reasoning is simple, instead of creating a HwmdRenderTarget, we create a SwapChain using our DirectX11 device instead.
Then we can create a Direct2d Render Context from this:
Code Snippet
- var context2d = new DeviceContext(this.swapChain.Texture.QueryInterface<SharpDX.DXGI.Surface>());
- //Call release on texture since queryinterface does an addref
- Marshal.Release(this.swapChain.Texture.NativePointer);
Yes, this is that easy.
Context is create from SwapChain, so it will use the device which own that one, instead of creating a new one.
Also since our context implements Direct2d RenderTarget, there was no code update for the rest of the ui rendering, life is good at times ;)
There's only one difference now, calling EndDraw does not trigger a Present call on the swapchain (technically I could use a standard render target instead of a SwapChain), so you have to call Present on your swapchain manually (not a big deal really ;)
So now I have device context working that way, I can just create a Pipeline Statistics and a TimeStamp query.
I'm pretty interested in primitive count and render time obiously.
1/Primitive cost
So first let's have a look at each geometry I use and check the cost at IA Stage.
Since I draw a border, I check the cost for the "Fill[Primitive]", and for the "Draw[Primitive] which builds me outline.
One I first use a lot is RoundedRectangle (with a 1 pixel corner).
FillRoundedRectangle (1 pix corner) : 96
DrawRoundedRectancle: 264
So a single round rectangle is more than 350 primitives per element, that hurts (a lot)
Let replace those by standard Rectangle (as a side note since I have the grid snap, the visual difference was actually what I could call: "None")
FillRectangle : 6
This was pretty expected, 2 triangles
DrawRectangle : 190
This was much less expected, and feels rather high.
Still, replacing Round Rectangle by Rectangle pretty much half you poly count (and gave a pretty high boost to my rendering of course)
Now let's go for Ellipse (Which I use for pins/keyframes in timeliner):
FillEllipse : 204
DrawEllipse : 204
This is quite a staggering cost, I often have more than 1000 keyframes in my timeliner nowadays (Ok I cull keyframe rendering already), so my worst case scenarios (fully unzoomed ui and panel with all visible) is a whooping 408000 primitives!
In case of a patch, I can replace pin from Ellipse to Rectangle as well (I still learned to prefer ellipse which is visually much more pleasing, but then I can also add a render mode param (to shoose low/high quality).
In case of Timeline, can't really use quad, so I need a solution (see below)
Next we have Lines/Beziers
Line (whatever vertical/size...) : 46
Bezier : from 312 -> 620
Dashed Bezier : same as above
So links are also rather expensive, switch to choose link style is definitely a nice thing to add.
And obviously, finally :
Text : 6
I'll go back into this later, but pretty much text is 6 primitives (and likely a sprite sheet texture bound as well).
2/Draw ordering and Buffers
This is one thing which I actually looked more while using the graphics debugger, but this also gives you very valuable information.
For any solid color brush, direct2d fills the geometry content into a buffer, and either when it gets a context change (see below) or buffer is full, It copies the buffer and do a draw.
So it does a pretty cool job at limiting draw calls, but does not use so much of GPU instancing, except for Text rendering.
So let's take a standard Node draw routine, and well see what's bad in there (pseudo code)
for each node
fill rectangle
draw outline
draw title (text)
draw pins (another loop)
end for
This translates this way in Direct2d (let's say we have 2 nodes, and I'll remove pins for clarity)
context->Draw(pcount, offset); //This is rectangle + first outline
context->DrawInstanced(6,1,0,0,0); //Text
context->Draw(pcount, offset); //This is rectangle + first outline
context->DrawInstanced(6,1,0,0,0); //Text
So each node need 2 draw calls.
Obvious issue is that text rendering requires a different set of shaders, so Direct2d has to swap and can't batch efficiently anymore.
So let's reorganize out drawing this way:
for each node
fill rectangle
draw outline
end for
for each node
draw title
end for
so now we are instead building 2 loops, first we render all the rectangles, then we render all the text.
And then magically :
context->DrawInstanced(6,n,0,0,0); //Node count
So now all our text is batched in a single draw, and rectangle is also reduced (depending on node count, but pretty often it reduces to a couple of calls maximum).
3/Hybrid rendering
So now by replacing some elements, I already managed to get quite a significant gain,
Here is roundrect to rect cost on a reasonably large patch (please note that cpu/gpu times are not additive, since they work in tandem)
RoundRect + Ellipse
CPU : 6.5ms
GPU : 4ms
Rectangles
CPU : 3ms
GPU : 0.5ms
This is a pretty huge boost (specially considering we have the same quality).
Now I mentioned that swapping Ellipse for Rectangle was not an option for timeline, so I need a solution.
The obvious first choice is to use a small circle texture, but that does not fit really well with antialias (quite a big loss of quality).
So I need another solution....
And...
Do you remember? I am now drawing on a DirectX11 Swapchain, and I got access to it.
So let's move hybrid
Here is a 1000 + keyframe rendering profile (full redraw every frame, all keyframes visible)
CPU : 14ms
GPU : 9ms
Primitives : 388k
We can clearly see that hurts our graphics card quite a lot, since Direct2d doesn't instance in the GPU side, we have 388000 primitives uploaded on our GPU.
Rendering process as follow (pseudo code again)
for each track
render header
for each keyframe
if keyframe in timeline time range
calculate position
draw keyframe
end if
end for
end for
render other bits (rulers...)
So now let's give DirectX11 a bit of work, we create a simple instancing shader (rebuilds size and get color from a small buffer)
We create 2 structured buffers (1 for screen space position, 1 for color index)
and change the rendering as follow
reset keyframe counter
for each track
render header
for each keyframe
if keyframe in timeline time range
add position/colorid into the cpu array
increment keyframe counter
end if
end for
end for
if keyframe counter > 0
end draw (give hand from d2d to d3d)
copy position/colorid to buffers
draw instanced circles (outline) -> only need to upload position buffer + single draw => huge win)
draw instanced circles (background) -> reuse the same buffer but scale down the circle in VS
begin draw (we give back direct2d drawing rights)
end if
render other bits
Now using instanced circles here we are
CPU : 3ms
GPU : 2.4ms
Primitives : 51k
That's 5 times faster on cpu workload, and 4 times faster on GPU, not bad ;)
Also we divided our primitive count by more than 6, without a loos of quality!
Let's try other techniques (Note: here we lose antialias in that case)
Instanced Rectangles-> clip/discard in pixel shader
CPU : 3ms
GPU : 2.0ms
Primitives : 13k
No initial geometry (build both Rectangles in GS, clip in PS)
CPU : 2.9ms
GPU : 1.7ms
Primitives : 11k
In case you accept to lose AA settings, that can be another reasonable gain (thinking lower end machines)
So be able to use DirectX11 alongside Direct2d is a pretty massive win :)
4/Next stage
Obviously realizing how much gain we can get out of hybrid rendering, it would be a shame to stop here, users love smooth UI, so let's strive to give them this :)
Also having access to device makes it much easier for some other features (like draw some texture inside the d2d viewport).
Also as a side note, yes idea is to render user interface every frame (no partial redraw for now).
Maybe partial can feel more efficient, but only once you nailed the full draw (since a zoom/pan = redraw, I don't want a half a second drop when I do this action ;)
New results/post soon