Friday 20 February 2015

Profiling Direct2d (and hybrid ui rendering)

For a while I really wanted to know how much my User interface rendering costs me in pure reality.
I can consider my ui pretty smooth, but smoother is better than smooth ;)

So to debug Direct2d, you can use the Visual Studio Graphics debugger, but since I have several windows/panels, It's quite hard to get a screenshot.

Also I noticed that sometimes a snapshot doesn't include all the elements.

Technically I know that Direct2d uses a DirectX11 (with feature level 10) device, so if I can get this device pointer, I can easily run queries between those BeginDraw/EndDraw calls.

But it looks like it's not possible....

However...

It is now possible to provide our own DirectX11 device to a Direct2d context. I looked at that feature earlier on and thought it would be really complicated to update, but this was so simple that I feel embarrassed that I didn't do it before ;)

So reasoning is simple, instead of creating a HwmdRenderTarget, we create a SwapChain using our DirectX11 device instead.

Then we can create a Direct2d Render Context from this:

Code Snippet
  1. var context2d = new DeviceContext(this.swapChain.Texture.QueryInterface<SharpDX.DXGI.Surface>());
  2. //Call release on texture since queryinterface does an addref
  3. Marshal.Release(this.swapChain.Texture.NativePointer);


Yes, this is that easy.

Context is create from SwapChain, so it will use the device which own that one, instead of creating a new one.

Also since our context implements Direct2d RenderTarget, there was no code update for the rest of the ui rendering, life is good at times ;)

There's only one difference now, calling EndDraw does not trigger a Present call on the swapchain (technically I could use a standard render target instead of a SwapChain), so you have to call Present on your swapchain manually (not a big deal really ;)


So now I have device context working that way, I can just create a Pipeline Statistics and a TimeStamp query.

I'm pretty interested in primitive count and render time obiously.

1/Primitive cost

So first let's have a look at each geometry I use and check the cost at IA Stage. 
Since I draw a border, I check the cost for the "Fill[Primitive]", and for the "Draw[Primitive] which builds me outline.

One I first use a lot is RoundedRectangle (with a 1 pixel corner).

FillRoundedRectangle (1 pix corner) : 96
DrawRoundedRectancle: 264

So a single round rectangle is more than 350 primitives per element, that hurts (a lot)

Let replace those by standard Rectangle (as a side note since I have the grid snap, the visual difference was actually what I could call: "None")

FillRectangle : 6

This was pretty expected, 2 triangles

DrawRectangle : 190

This was much less expected, and feels rather high.

Still, replacing Round Rectangle by Rectangle pretty much half you poly count (and gave a pretty high boost to my rendering of course)

Now let's go for Ellipse (Which I use for pins/keyframes in timeliner):

FillEllipse : 204
DrawEllipse : 204

This is quite a staggering cost, I often have more than 1000 keyframes in my timeliner nowadays (Ok I cull keyframe rendering already), so my worst case scenarios (fully unzoomed ui and panel with all visible) is a whooping 408000 primitives!

In case of a patch, I can replace pin from Ellipse to Rectangle as well (I still learned to prefer ellipse which is visually much more pleasing, but then I can also add a render mode param (to shoose low/high quality).

In case of Timeline, can't really use quad, so I need a solution (see below)

Next we have Lines/Beziers

Line (whatever vertical/size...) : 46
Bezier : from 312 -> 620
Dashed Bezier : same as above

So links are also rather expensive, switch to choose link style is definitely a nice thing to add.

And obviously, finally : 

Text : 6 

I'll go back into this later, but pretty much text is 6 primitives (and likely a sprite sheet texture bound as well).

2/Draw ordering and Buffers

This is one thing which I actually looked more while using the graphics debugger, but this also gives you very valuable information.

For any solid color brush, direct2d fills the geometry content into a buffer, and either when it gets a context change (see below) or buffer is full, It copies the buffer and do a draw. 

So it does a pretty cool job at limiting draw calls, but does not use so much of GPU instancing, except for Text rendering.

So let's take a standard Node draw routine, and well see what's bad in there (pseudo code)

for each node
    fill rectangle
    draw outline
    draw title (text)
    draw pins (another loop)
end for

This translates this way in Direct2d (let's say we have 2 nodes, and I'll remove pins for clarity)

context->Draw(pcount, offset); //This is rectangle + first outline
context->DrawInstanced(6,1,0,0,0); //Text
context->Draw(pcount, offset); //This is rectangle + first outline
context->DrawInstanced(6,1,0,0,0); //Text

So each node need 2 draw calls. 
Obvious issue is that text rendering requires a different set of shaders, so Direct2d has to swap and can't batch efficiently anymore.

So let's reorganize out drawing this way:

for each node
    fill rectangle
    draw outline
end for

for each node
    draw title
end for

so now we are instead building 2 loops, first we render all the rectangles, then we render all the text.

And then magically :
context->DrawInstanced(6,n,0,0,0); //Node count

So now all our text is batched in a single draw, and rectangle is also reduced (depending on node count, but pretty often it reduces to a couple of calls maximum).


3/Hybrid rendering

So now by replacing some elements, I already managed to get quite a significant gain, 

Here is roundrect to rect cost on a reasonably large patch (please note that cpu/gpu times are not additive, since they work in tandem)

RoundRect + Ellipse
CPU : 6.5ms
GPU : 4ms

Rectangles
CPU : 3ms
GPU : 0.5ms

This is a pretty huge boost (specially considering we have the same quality).

Now I mentioned that swapping Ellipse for Rectangle was not an option for timeline, so I need a solution.

The obvious first choice is to use a small circle texture, but that does not fit really well with antialias (quite a big loss of quality).

So I need another solution....

And...

Do you remember? I am now drawing on a DirectX11 Swapchain, and I got access to it.

So let's move hybrid

Here is a 1000 + keyframe rendering profile (full redraw every frame, all keyframes visible)
CPU : 14ms
GPU : 9ms
Primitives : 388k

We can clearly see that hurts our graphics card quite a lot, since Direct2d doesn't instance in the GPU side, we have 388000 primitives uploaded on our GPU.

Rendering process as follow (pseudo code again)

for each track
    render header 
    for each keyframe
        if keyframe in timeline time range
           calculate position
           draw keyframe
        end if
    end for
end for
render other bits (rulers...)

So now let's give DirectX11 a bit of work, we create a simple instancing shader (rebuilds size and get color from a small buffer)

We create 2 structured buffers (1 for screen space position, 1 for color index)

and change the rendering as follow

reset keyframe counter
for each track 
   render header
    for each keyframe
        if keyframe in timeline time range
            add position/colorid into the cpu array
            increment keyframe counter
        end if
    end for
end for   
if keyframe counter > 0
    end draw (give hand from d2d to d3d)
    copy position/colorid to buffers
    draw instanced circles (outline) -> only need to upload position buffer + single draw => huge win)
    draw instanced circles (background) -> reuse the same buffer but scale down the circle in VS
    begin draw (we give back direct2d drawing rights)
end if
render other bits

Now using instanced circles here we are
CPU  : 3ms
GPU : 2.4ms
Primitives : 51k

That's 5 times faster on cpu workload, and 4 times faster on GPU, not bad ;)
Also we divided our primitive count by more than 6, without a loos of quality!

Let's try other techniques (Note: here we lose antialias in that case)
Instanced Rectangles-> clip/discard in pixel shader
CPU  : 3ms
GPU : 2.0ms
Primitives : 13k

No initial geometry (build both Rectangles in GS, clip in PS)
CPU : 2.9ms
GPU : 1.7ms
Primitives : 11k

In case you accept to lose AA settings, that can be another reasonable gain (thinking lower end machines)

So be able to use DirectX11 alongside Direct2d is a pretty massive win :)

4/Next stage

Obviously realizing how much gain we can get out of hybrid rendering, it would be a shame to stop here, users love smooth UI, so let's strive to give them this :)

Also having access to device makes it much easier for some other features (like draw some texture inside the d2d viewport).

Also as a side note, yes idea is to render user interface every frame (no partial redraw for now). 
Maybe partial can feel more efficient, but only once you nailed the full draw (since a zoom/pan = redraw, I don't want a half a second drop when I do this action ;)

New results/post soon

Saturday 14 February 2015

Timeliner (again)

I didn't post in my blog for quite a while, been rather busy on projects, January was rather packed.

I used vvvv for one and FlareTic for the other.

At the end of each project, I could see how important a proper Timeliner is and how this changes your workflow (really I mean it, and coming from me who used to hate timelines, it must mean something...)

I already added the basic remaining features in december (copy paste, undo/redo stack, organize tracks/groups, zoom+ pan, navigation, snap to rulers...), which is the MINIMUM expected.

Once the basic is done, time to start fancy features (always do basic then fancy, not the other way round...)

First is keyframe group, you select a bunch of keyframes and can move them as a single unit (while also allowing to move keyframe itself).


You can see keyframes with small alpha, ctrl + click allows you to move the whole group instead of the track itself.

This is really handy when you need to quickly organize your key frame after some requirement change.

Next you can notice purple/blue keyframes, what are those?

This is a simple thing called aliases. So those keyframes reference a parent one.

This is really useful when you copy a bulk of keyframes, but need the same value (almost). Alias host an offset to the parent value, so changing the parent also change the alias, while you can keep a bit of variation using offsets.


So all of this really changed my way of working, but now, let's think forward...non linear playback.

First I already have a playback optimizer (using tree based interval data structure).

So now let's allow to play those clips somewhere else.


Ok screen shot doesn't show much cool stuff, but simple concept is, I can now have a track as a node, and a custom time pin input, so I have the ease of design (via editor), but can also play a track (eventually in different places with different times).

I got 2 versions of track playback, one with time input, one with a signal that asks: start play (which I use with kinect gesture recognizer for example).

That's a pretty handy new feature, but let's not stop in there, let's go... forward (things are never good enough ;)

One common issue I have, most of my data in my tool is stored in GPU (particle system, generated geometry...)

Many times I need some form of control over it (let's say for example compute particle size from age). I often use quick formulas for that, but then thinking : this timeline editor is so easy to use, why not having this track used for this form of control instead of my ghetto formulas?

Track data fits really well into a 1d texture, so here we go, create a node that renders the whole track into texture (with controllable precision), sampler with smooth it out a little if really needed.


Here we go, create new particle behaviour that take 1d texture as control, and here we go, half million particles with age/size controlled via timeliner, life is beautiful as times ;)

As a side note, playback control using 1d texture/sampler is also so seamless that it becomes embarrassingly easy :
* One off playback : clamp sampler
* Loop : wrap
* Ping pong : mirror

Next stage (going forward again), Multi render into Texture1DArray (so can use several tracks for diversity).

As a side note my node collection also did a pretty hefty climb, but let's keep this for later posts :)