Catflier

Friday, 20 February 2015

Profiling Direct2d (and hybrid ui rendering)

For a while I really wanted to know how much my User interface rendering costs me in pure reality.
I can consider my ui pretty smooth, but smoother is better than smooth ;)

So to debug Direct2d, you can use the Visual Studio Graphics debugger, but since I have several windows/panels, It's quite hard to get a screenshot.

Also I noticed that sometimes a snapshot doesn't include all the elements.

Technically I know that Direct2d uses a DirectX11 (with feature level 10) device, so if I can get this device pointer, I can easily run queries between those BeginDraw/EndDraw calls.

But it looks like it's not possible....

However...

It is now possible to provide our own DirectX11 device to a Direct2d context. I looked at that feature earlier on and thought it would be really complicated to update, but this was so simple that I feel embarrassed that I didn't do it before ;)

So reasoning is simple, instead of creating a HwmdRenderTarget, we create a SwapChain using our DirectX11 device instead.

Then we can create a Direct2d Render Context from this:

Code Snippet

var context2d = new DeviceContext(this.swapChain.Texture.QueryInterface<SharpDX.DXGI.Surface>());
//Call release on texture since queryinterface does an addref
Marshal.Release(this.swapChain.Texture.NativePointer);

Yes, this is that easy.

Context is create from SwapChain, so it will use the device which own that one, instead of creating a new one.

Also since our context implements Direct2d RenderTarget, there was no code update for the rest of the ui rendering, life is good at times ;)

There's only one difference now, calling EndDraw does not trigger a Present call on the swapchain (technically I could use a standard render target instead of a SwapChain), so you have to call Present on your swapchain manually (not a big deal really ;)

So now I have device context working that way, I can just create a Pipeline Statistics and a TimeStamp query.

I'm pretty interested in primitive count and render time obiously.

1/Primitive cost

So first let's have a look at each geometry I use and check the cost at IA Stage.

Since I draw a border, I check the cost for the "Fill[Primitive]", and for the "Draw[Primitive] which builds me outline.

One I first use a lot is RoundedRectangle (with a 1 pixel corner).

FillRoundedRectangle (1 pix corner) : 96

DrawRoundedRectancle: 264

So a single round rectangle is more than 350 primitives per element, that hurts (a lot)

Let replace those by standard Rectangle (as a side note since I have the grid snap, the visual difference was actually what I could call: "None")

FillRectangle : 6

This was pretty expected, 2 triangles

DrawRectangle : 190

This was much less expected, and feels rather high.

Still, replacing Round Rectangle by Rectangle pretty much half you poly count (and gave a pretty high boost to my rendering of course)

Now let's go for Ellipse (Which I use for pins/keyframes in timeliner):

FillEllipse : 204

DrawEllipse : 204

This is quite a staggering cost, I often have more than 1000 keyframes in my timeliner nowadays (Ok I cull keyframe rendering already), so my worst case scenarios (fully unzoomed ui and panel with all visible) is a whooping 408000 primitives!

In case of a patch, I can replace pin from Ellipse to Rectangle as well (I still learned to prefer ellipse which is visually much more pleasing, but then I can also add a render mode param (to shoose low/high quality).

In case of Timeline, can't really use quad, so I need a solution (see below)

Next we have Lines/Beziers

Line (whatever vertical/size...) : 46

Bezier : from 312 -> 620

Dashed Bezier : same as above

So links are also rather expensive, switch to choose link style is definitely a nice thing to add.

And obviously, finally :

Text : 6

I'll go back into this later, but pretty much text is 6 primitives (and likely a sprite sheet texture bound as well).

2/Draw ordering and Buffers

This is one thing which I actually looked more while using the graphics debugger, but this also gives you very valuable information.

For any solid color brush, direct2d fills the geometry content into a buffer, and either when it gets a context change (see below) or buffer is full, It copies the buffer and do a draw.

So it does a pretty cool job at limiting draw calls, but does not use so much of GPU instancing, except for Text rendering.

So let's take a standard Node draw routine, and well see what's bad in there (pseudo code)

for each node

fill rectangle

draw outline

draw title (text)

draw pins (another loop)

end for

This translates this way in Direct2d (let's say we have 2 nodes, and I'll remove pins for clarity)

context->Draw(pcount, offset); //This is rectangle + first outline

context->DrawInstanced(6,1,0,0,0); //Text

context->Draw(pcount, offset); //This is rectangle + first outline

context->DrawInstanced(6,1,0,0,0); //Text

So each node need 2 draw calls.

Obvious issue is that text rendering requires a different set of shaders, so Direct2d has to swap and can't batch efficiently anymore.

So let's reorganize out drawing this way:

for each node

fill rectangle

draw outline

end for

for each node

draw title

end for

so now we are instead building 2 loops, first we render all the rectangles, then we render all the text.

And then magically :

context->DrawInstanced(6,n,0,0,0); //Node count

So now all our text is batched in a single draw, and rectangle is also reduced (depending on node count, but pretty often it reduces to a couple of calls maximum).

3/Hybrid rendering

So now by replacing some elements, I already managed to get quite a significant gain,

Here is roundrect to rect cost on a reasonably large patch (please note that cpu/gpu times are not additive, since they work in tandem)

RoundRect + Ellipse

CPU : 6.5ms

GPU : 4ms

Rectangles

CPU : 3ms

GPU : 0.5ms

This is a pretty huge boost (specially considering we have the same quality).

Now I mentioned that swapping Ellipse for Rectangle was not an option for timeline, so I need a solution.

The obvious first choice is to use a small circle texture, but that does not fit really well with antialias (quite a big loss of quality).

So I need another solution....

And...

Do you remember? I am now drawing on a DirectX11 Swapchain, and I got access to it.

So let's move hybrid

Here is a 1000 + keyframe rendering profile (full redraw every frame, all keyframes visible)

CPU : 14ms

GPU : 9ms

Primitives : 388k

We can clearly see that hurts our graphics card quite a lot, since Direct2d doesn't instance in the GPU side, we have 388000 primitives uploaded on our GPU.

Rendering process as follow (pseudo code again)

for each track

render header

for each keyframe

if keyframe in timeline time range

calculate position

draw keyframe

end if

end for

render other bits (rulers...)

So now let's give DirectX11 a bit of work, we create a simple instancing shader (rebuilds size and get color from a small buffer)

We create 2 structured buffers (1 for screen space position, 1 for color index)

and change the rendering as follow

reset keyframe counter

for each track

render header

for each keyframe

if keyframe in timeline time range

add position/colorid into the cpu array

increment keyframe counter

end if

end for

if keyframe counter > 0

end draw (give hand from d2d to d3d)

copy position/colorid to buffers

draw instanced circles (outline) -> only need to upload position buffer + single draw => huge win)

draw instanced circles (background) -> reuse the same buffer but scale down the circle in VS

begin draw (we give back direct2d drawing rights)

end if

render other bits

Now using instanced circles here we are

CPU : 3ms

GPU : 2.4ms

Primitives : 51k

That's 5 times faster on cpu workload, and 4 times faster on GPU, not bad ;)

Also we divided our primitive count by more than 6, without a loos of quality!

Let's try other techniques (Note: here we lose antialias in that case)

Instanced Rectangles-> clip/discard in pixel shader

CPU : 3ms

GPU : 2.0ms

Primitives : 13k

No initial geometry (build both Rectangles in GS, clip in PS)

CPU : 2.9ms

GPU : 1.7ms

Primitives : 11k

In case you accept to lose AA settings, that can be another reasonable gain (thinking lower end machines)

So be able to use DirectX11 alongside Direct2d is a pretty massive win :)

4/Next stage

Obviously realizing how much gain we can get out of hybrid rendering, it would be a shame to stop here, users love smooth UI, so let's strive to give them this :)

Also having access to device makes it much easier for some other features (like draw some texture inside the d2d viewport).

Also as a side note, yes idea is to render user interface every frame (no partial redraw for now).

Maybe partial can feel more efficient, but only once you nailed the full draw (since a zoom/pan = redraw, I don't want a half a second drop when I do this action ;)

New results/post soon

Saturday, 14 February 2015

Timeliner (again)

I didn't post in my blog for quite a while, been rather busy on projects, January was rather packed.

I used vvvv for one and FlareTic for the other.

At the end of each project, I could see how important a proper Timeliner is and how this changes your workflow (really I mean it, and coming from me who used to hate timelines, it must mean something...)

I already added the basic remaining features in december (copy paste, undo/redo stack, organize tracks/groups, zoom+ pan, navigation, snap to rulers...), which is the MINIMUM expected.

Once the basic is done, time to start fancy features (always do basic then fancy, not the other way round...)

First is keyframe group, you select a bunch of keyframes and can move them as a single unit (while also allowing to move keyframe itself).

You can see keyframes with small alpha, ctrl + click allows you to move the whole group instead of the track itself.

This is really handy when you need to quickly organize your key frame after some requirement change.

Next you can notice purple/blue keyframes, what are those?

This is a simple thing called aliases. So those keyframes reference a parent one.

This is really useful when you copy a bulk of keyframes, but need the same value (almost). Alias host an offset to the parent value, so changing the parent also change the alias, while you can keep a bit of variation using offsets.

So all of this really changed my way of working, but now, let's think forward...non linear playback.

First I already have a playback optimizer (using tree based interval data structure).

So now let's allow to play those clips somewhere else.

Ok screen shot doesn't show much cool stuff, but simple concept is, I can now have a track as a node, and a custom time pin input, so I have the ease of design (via editor), but can also play a track (eventually in different places with different times).

I got 2 versions of track playback, one with time input, one with a signal that asks: start play (which I use with kinect gesture recognizer for example).

That's a pretty handy new feature, but let's not stop in there, let's go... forward (things are never good enough ;)

One common issue I have, most of my data in my tool is stored in GPU (particle system, generated geometry...)

Many times I need some form of control over it (let's say for example compute particle size from age). I often use quick formulas for that, but then thinking : this timeline editor is so easy to use, why not having this track used for this form of control instead of my ghetto formulas?

Track data fits really well into a 1d texture, so here we go, create a node that renders the whole track into texture (with controllable precision), sampler with smooth it out a little if really needed.

Here we go, create new particle behaviour that take 1d texture as control, and here we go, half million particles with age/size controlled via timeliner, life is beautiful as times ;)

As a side note, playback control using 1d texture/sampler is also so seamless that it becomes embarrassingly easy :

* One off playback : clamp sampler

* Loop : wrap

* Ping pong : mirror

Next stage (going forward again), Multi render into Texture1DArray (so can use several tracks for diversity).

As a side note my node collection also did a pretty hefty climb, but let's keep this for later posts :)

Sunday, 23 November 2014

Shader Linker (Part 2)

Lately I had some time again to work in FlareTic.

I added some new nice pixel material functions, here are a couple of screenshots:

Now a lot of my procedural materials share a lot in common, they are mostly either wave/noise functions, which are then combined.

In the example above I use 4 noise functions (2 for roughness, 2 for reflectivity), which are then combined either as additive/multiply.

Now being able to combine those functions in an easier way would be rather handy. And hey, I already have a function linker patch to generate pixel shaders.

Even tho I have the base that is totally required (hybrid node/code linker), it still has some flaws.

First, you need a lot of swizzle operators. This adds a lot of pollution in the patch, to just process something like "set float to vector4".

But luckily, I already mentionned I had implicit converters.

So I created a simple version of them, which add extra instruction to linker, so it can call passvaluewithswizzle instead of passvalue.

If we look at the screenshot above, now instead of having a xxxx node to convert float to float4 we can see converter just added itself implicitely, reducing patch node pollution.

Now if we want to perform conversion the opposite way (for example, float4 to float), we run into an issue : which component to take?

But since I can add configuration to links (yes in FlareTic links can also have parameters that you can modify via inspector), this is suddenly trivial, I just add a swizzle parameter so we can change which component we want.

In that case it had a little twist, since when this parameter changes, I need to ask the linker to build me a new pixelshader, but that's rather trivial.

You can see in screenshot above, selected link has a swizzle parameter (those links are yellow in the patch, since they imply a loss of data, and it makes it easier for the user to see they can modifiy link behaviour).

Next there's the most serious issue for usability.

To be able to create a pixelshader, the graph must be complete, if an input pin does not have a connection link process will fail.

So I could easily provide a default, but let's think better, if an input pin has no connection, it would be much nicer to have the value in inspector and be able to change it real time.

So instead, I build a hidden RawBuffer, and a few reader functions.

When I parse the graph, if a pin is connected i call passvalue, if not, I ask the hiddenbuffer for a data slot, hiddenbuffer also returns me reader node, which will grab data from buffer.

Before to run the pixelshader, I grab the data from the inspector and copy to buffer, so it's easy to tweak input parameters if they are not connected.

You can see that the multiply node shows me value editor for non connected pin. Modifying value does not need to relink shader, it's just copied in buffer before the call.

Now once all of this is done, I just needed to create a Input/Output template for my deffered materials, and make sure I sort the calls properly.

And here we go, hybrid code/patch material editor, promised I'll do some nicer screenshots next time ;)

Now for next feature set (still work in progress):

Function grouping
More aggressive packing for buffer data.
Custom cbuffer integration

Tuesday, 28 October 2014

Hap Attack (Part 2)

In previous post I explained a bit how to decode Hap files.

I explained a bit how the QuickTime format works, so let's show a bit of code.

First we need to access a leaf node to extract information, for this let's build a small interface:

Code Snippet

public interface ILeafAtomReader
{
    void Read(FileStream ds);
}

Now let's show an example implementation:

Code Snippet

public class ChunkOffsetReader : ILeafAtomReader
{
    private List<uint> chunkOffsetTable = new List<uint>();
 
    public List<uint> Table
    {
        get { return this.chunkOffsetTable; }
    }
 
    public void Read(FileStream ds)
    {
        //Bypass header
        ds.Seek(4, SeekOrigin.Current);
 
        uint entrycount = ds.ReadSize();
 
        for (uint i = 0; i < entrycount; i++)
        {
            uint size = ds.ReadSize();
            chunkOffsetTable.Add(size);
        }
    }
}

This is reasonably simple, we go parse the data we require.

Now there is a little problem, some parsers will read the whole atom, some parsers might only read the data they want, so our file position pointer might not be at the end of the atom.

To circumvent that, let's add a little adapter:

Code Snippet

public class LeafAtomReaderAdapter : ILeafAtomReader
{
    private readonly ILeafAtomReader reader;
 
    public LeafAtomReaderAdapter(ILeafAtomReader reader)
    {
        if (reader == null)
            throw new ArgumentNullException("reader");
 
        this.reader = reader;
    }
 
    public void Read(FileStream ds)
    {
        var currentpos = ds.Position;
        reader.Read(ds);
        ds.Seek(currentpos, SeekOrigin.Begin);
    }
}

This takes another atom reader, but before to let it read, it stores the file position pointer, and restores it once the other reader is done.

Since Atom order is not guaranteed, we also need to tell which containers we are interested in:

Code Snippet

private string[] containers = new string[]
{
    "moov","trak","mdia","minf","stbl"
};

Then once we find the right media, sample table (the one which contains hap), we need to lookup a bit of extra information, so we need to store moov and trak atom offset (so we can then read tkhd to get video size info, and mvhd to get time units).

Code Snippet

if (containers.Contains(fcc.ToString()))
{
    if (fcc.ToString() == "trak")
    {
        this.currenttrakoffset = ds.Position;
    }
    if (fcc.ToString() == "moov")
    {
        this.currentmoovoffset = ds.Position;
    }
 
    //Keep parent position, since we'll want to get this to read sample table
    Parse(ds, ds.Position);
}

Once we found a track with hap, we can jump back to the file position and go read headers.

So now we can finally play hap files.
Only issue, without ssd, this is drive intensive, and we generally have a lot of memory, so let's allow to load the whole video data in ram.

This is done differently in QT and Avi.

In QT I already built the lookup table, so I can just load a copy of the file in memory, and lookup from there:

Code Snippet

public unsafe static DataStream ReadFile(string path, CancellationToken token, IProgress<double> progress, int chunkSize = 1024)
{
    var fs = File.OpenRead(path);
 
    IntPtr dataPointer = Marshal.AllocHGlobal((int)fs.Length);
    IntPtr pointerOffset = dataPointer;
 
    byte[] chunk = new byte[chunkSize];
    int remaining = Convert.ToInt32(fs.Length - fs.Position);
    int read = 0;
 
    while (remaining > 0)
    {
        int toread = Math.Min(remaining, chunkSize);
 
        fs.Read(chunk, 0, toread);
        Marshal.Copy(chunk, 0, pointerOffset, toread);
 
        pointerOffset += toread;
        read += toread;
 
        double p = (double)read / (double)fs.Length;
        progress.Report(p);
 
        remaining = Convert.ToInt32(fs.Length - fs.Position);
 
        if (token.IsCancellationRequested)
        {
            fs.Close();
            Marshal.FreeHGlobal(dataPointer);
            throw new OperationCanceledException();
        }
    }
 
    var ds = new DataStream(dataPointer, fs.Length, true, false);
    fs.Close();
 
    return ds;
}

This is just a simple file reader, that grabs blocks and report progress, so it can be sent as a background task.

For Avi I got no lookup table, but some api to get frameindex -> data (from disk). So I create a memory block large enough to contain the whole video (file size works perfectly for that purpose ;)

Then In background I go request frames and build a prefix sum:

Code Snippet

public unsafe static AviOffsetTable BuildTable(hapFileVFW fileinfo, CancellationToken token, IProgress<double> progress)
{
    long fileLength = new FileInfo(fileinfo.Path).Length;
    int frameCount = fileinfo.FrameCount;
 
    IntPtr dataPointer = Marshal.AllocHGlobal((int)fileLength);
    IntPtr offsetPointer = dataPointer;
 
    List<OffsetTable> offsetTable = new List<OffsetTable>();
 
    int readBytes = 0;
    int currentOffset = 0;
    for (int i = 0; i < frameCount; i++)
    {
        fileinfo.WriteFrame(i, offsetPointer, out readBytes);
 
        OffsetTable t = new OffsetTable()
        {
            Length = readBytes,
            Offset = currentOffset
        };
 
        offsetTable.Add(t);
 
        offsetPointer += readBytes;
        currentOffset += readBytes;
 
        double prog = (double)i / (double)frameCount;
        progress.Report(prog);
 
        if (token.IsCancellationRequested)
        {
            Marshal.FreeHGlobal(dataPointer);
            throw new OperationCanceledException();
        }
    }
    progress.Report(1.0);
    return new AviOffsetTable(offsetTable, dataPointer);
}

This is simple too, we just ask the avi wrapper to write into our pointer, get number of bytes written and move pointer by that offset for next frame. At the same time we build our offset table.

Once we have our data loaded in memory everything is much simpler :

Code Snippet

public IntPtr ReadFrame(int frameIndex, IntPtr buffer)
{
    if (this.memoryLoader != null && this.memoryLoader.Complete)
    {
        var tbl = this.memoryLoader.DataStream;
        IntPtr dataPointer = tbl.DataPointer;
        var poslength = tbl.Table[frameIndex];
        dataPointer += (int)poslength.Offset;
        return dataPointer;
    }
    else
    {
        int readBytes = 0;
        int readSamples = 0;
        Avi.AVIStreamRead(this.VideoStream, frameIndex, 1, buffer, this.frameSize.Width * this.frameSize.Height*6, ref readBytes, ref readSamples);
        return buffer;
    }
}

in first case we just return a pointer from our lookup table (no memory copy required), in the second case we read from disk.

Preloading content into memory gives a huge performance gain (and memory is rather cheap, easy to have 64 gigs in a single machine, so preload can be a definite good option).

So after that comes all the usual cleanup, manage videos element count and make sure we don't have memory leaks / crashes.

Now I have a really nicely working player, why limit our imagination?

First, I wanted to test some 8k encoding, so I exported a few frames from 4v and tried to use virtualdub for encode in hap. Press Save->Out of memory.

So instead, let's just encode directly from vvvv ;)

Writing encoder was easy, you set avi headers with you video size/framerate/compression, then you only need to get texture from gpu, convert in whichever dxt/bc format you want, compress with snappy if required, write frame.

One thing well done!

Next, since we run on dx11 hardware, we have access to new block compression formats:

BC6: three channels half floating point (hdr playback, mmmmmhhhh)
BC7: 4 channels, better quality than BC3/DXT5, but encoding is really slow

So let's add a few more FourCC, and add option in encoder/decoder:

Now we have new hap Formats:

Code Snippet

public enum hapFormat
{
    RGB_DXT1_None = 0xAB,
    RGB_DXT1_Snappy = 0xBB,
    RGBA_DXT5_None = 0xAE,
    RGBA_DXT5_Snappy = 0xBE,
    YCoCg_DXT5_None = 0xAF,
    YCoCg_DXT5_Snappy = 0xBF,
    RGB_BC6S_None = 0xA3,
    RGB_BC6S_Snappy = 0xB3,
    RGB_BC6U_None = 0xA4,
    RGB_BC6U_Snappy = 0xB4,
    RGBA_BC7_None = 0xA7,
    RGBA_BC7_Snappy = 0xB7,
}

Please note that those formats are also available out of the box in OpenGL (btpc compression).
So any software that use GL3.1 + can take advantage of it (and really softwares should already have moved to a GL4+ core profile, so there are NO excuses ;)

Finally, people always tend to think of videos as just a sequence of images.

Although there are some cases where other formats are more suitable (panoramic/dome projection).

In that case cubemaps are much more suited for this.

Oh and DXT/BC formats support cubemap compression.

So let's just write cubemap data as a frame, which was 0 lines of code in my case, since my writer already supports cubemap export.

Then there's only a little twist, in the avi stream info, don't forget to multiply data required by 6 (yes we now have 6 textures in one frame)

in AVISTREAMINFO :
public Int32 dwSuggestedBufferSize;

Is the field where we initiate buffers.

Then decoding frames works exactly like standard textures (Cubemaps are Texture2D too, so loading is done exactly the same way).

There's of course a little twist, in case of CubeTexture we need to set different parameters on ShaderResourceView creation:

Code Snippet

if (videoTexture.Description.OptionFlags.HasFlag(ResourceOptionFlags.TextureCube))
{
    ShaderResourceView videoView = new ShaderResourceView(device.Device, videoTexture);
}
else
{
    ShaderResourceViewDescription srvd = new ShaderResourceViewDescription()
    {
        ArraySize = 6,
        FirstArraySlice = 0,
        Dimension = ShaderResourceViewDimension.TextureCube,
        Format = videoTexture.Description.Format,
        MipLevels = videoTexture.Description.MipLevels,
        MostDetailedMip = 0,
    };
 
    ShaderResourceView videoView = new ShaderResourceView(device.Device, videoTexture,srvd);
}

That's more or less it, cube texture encoding/playback with Hap:

Some days well spent!!