Catflier: 2014

Sunday 23 November 2014

Shader Linker (Part 2)

Lately I had some time again to work in FlareTic.

I added some new nice pixel material functions, here are a couple of screenshots:

Now a lot of my procedural materials share a lot in common, they are mostly either wave/noise functions, which are then combined.

In the example above I use 4 noise functions (2 for roughness, 2 for reflectivity), which are then combined either as additive/multiply.

Now being able to combine those functions in an easier way would be rather handy. And hey, I already have a function linker patch to generate pixel shaders.

Even tho I have the base that is totally required (hybrid node/code linker), it still has some flaws.

First, you need a lot of swizzle operators. This adds a lot of pollution in the patch, to just process something like "set float to vector4".

But luckily, I already mentionned I had implicit converters.

So I created a simple version of them, which add extra instruction to linker, so it can call passvaluewithswizzle instead of passvalue.

If we look at the screenshot above, now instead of having a xxxx node to convert float to float4 we can see converter just added itself implicitely, reducing patch node pollution.

Now if we want to perform conversion the opposite way (for example, float4 to float), we run into an issue : which component to take?

But since I can add configuration to links (yes in FlareTic links can also have parameters that you can modify via inspector), this is suddenly trivial, I just add a swizzle parameter so we can change which component we want.

In that case it had a little twist, since when this parameter changes, I need to ask the linker to build me a new pixelshader, but that's rather trivial.

You can see in screenshot above, selected link has a swizzle parameter (those links are yellow in the patch, since they imply a loss of data, and it makes it easier for the user to see they can modifiy link behaviour).

Next there's the most serious issue for usability.

To be able to create a pixelshader, the graph must be complete, if an input pin does not have a connection link process will fail.

So I could easily provide a default, but let's think better, if an input pin has no connection, it would be much nicer to have the value in inspector and be able to change it real time.

So instead, I build a hidden RawBuffer, and a few reader functions.

When I parse the graph, if a pin is connected i call passvalue, if not, I ask the hiddenbuffer for a data slot, hiddenbuffer also returns me reader node, which will grab data from buffer.

Before to run the pixelshader, I grab the data from the inspector and copy to buffer, so it's easy to tweak input parameters if they are not connected.

You can see that the multiply node shows me value editor for non connected pin. Modifying value does not need to relink shader, it's just copied in buffer before the call.

Now once all of this is done, I just needed to create a Input/Output template for my deffered materials, and make sure I sort the calls properly.

And here we go, hybrid code/patch material editor, promised I'll do some nicer screenshots next time ;)

Now for next feature set (still work in progress):

Function grouping
More aggressive packing for buffer data.
Custom cbuffer integration

Tuesday 28 October 2014

Hap Attack (Part 2)

In previous post I explained a bit how to decode Hap files.

I explained a bit how the QuickTime format works, so let's show a bit of code.

First we need to access a leaf node to extract information, for this let's build a small interface:

Code Snippet

public interface ILeafAtomReader
{
    void Read(FileStream ds);
}

Now let's show an example implementation:

Code Snippet

public class ChunkOffsetReader : ILeafAtomReader
{
    private List<uint> chunkOffsetTable = new List<uint>();
 
    public List<uint> Table
    {
        get { return this.chunkOffsetTable; }
    }
 
    public void Read(FileStream ds)
    {
        //Bypass header
        ds.Seek(4, SeekOrigin.Current);
 
        uint entrycount = ds.ReadSize();
 
        for (uint i = 0; i < entrycount; i++)
        {
            uint size = ds.ReadSize();
            chunkOffsetTable.Add(size);
        }
    }
}

This is reasonably simple, we go parse the data we require.

Now there is a little problem, some parsers will read the whole atom, some parsers might only read the data they want, so our file position pointer might not be at the end of the atom.

To circumvent that, let's add a little adapter:

Code Snippet

public class LeafAtomReaderAdapter : ILeafAtomReader
{
    private readonly ILeafAtomReader reader;
 
    public LeafAtomReaderAdapter(ILeafAtomReader reader)
    {
        if (reader == null)
            throw new ArgumentNullException("reader");
 
        this.reader = reader;
    }
 
    public void Read(FileStream ds)
    {
        var currentpos = ds.Position;
        reader.Read(ds);
        ds.Seek(currentpos, SeekOrigin.Begin);
    }
}

This takes another atom reader, but before to let it read, it stores the file position pointer, and restores it once the other reader is done.

Since Atom order is not guaranteed, we also need to tell which containers we are interested in:

Code Snippet

private string[] containers = new string[]
{
    "moov","trak","mdia","minf","stbl"
};

Then once we find the right media, sample table (the one which contains hap), we need to lookup a bit of extra information, so we need to store moov and trak atom offset (so we can then read tkhd to get video size info, and mvhd to get time units).

Code Snippet

if (containers.Contains(fcc.ToString()))
{
    if (fcc.ToString() == "trak")
    {
        this.currenttrakoffset = ds.Position;
    }
    if (fcc.ToString() == "moov")
    {
        this.currentmoovoffset = ds.Position;
    }
 
    //Keep parent position, since we'll want to get this to read sample table
    Parse(ds, ds.Position);
}

Once we found a track with hap, we can jump back to the file position and go read headers.

So now we can finally play hap files.
Only issue, without ssd, this is drive intensive, and we generally have a lot of memory, so let's allow to load the whole video data in ram.

This is done differently in QT and Avi.

In QT I already built the lookup table, so I can just load a copy of the file in memory, and lookup from there:

Code Snippet

public unsafe static DataStream ReadFile(string path, CancellationToken token, IProgress<double> progress, int chunkSize = 1024)
{
    var fs = File.OpenRead(path);
 
    IntPtr dataPointer = Marshal.AllocHGlobal((int)fs.Length);
    IntPtr pointerOffset = dataPointer;
 
    byte[] chunk = new byte[chunkSize];
    int remaining = Convert.ToInt32(fs.Length - fs.Position);
    int read = 0;
 
    while (remaining > 0)
    {
        int toread = Math.Min(remaining, chunkSize);
 
        fs.Read(chunk, 0, toread);
        Marshal.Copy(chunk, 0, pointerOffset, toread);
 
        pointerOffset += toread;
        read += toread;
 
        double p = (double)read / (double)fs.Length;
        progress.Report(p);
 
        remaining = Convert.ToInt32(fs.Length - fs.Position);
 
        if (token.IsCancellationRequested)
        {
            fs.Close();
            Marshal.FreeHGlobal(dataPointer);
            throw new OperationCanceledException();
        }
    }
 
    var ds = new DataStream(dataPointer, fs.Length, true, false);
    fs.Close();
 
    return ds;
}

This is just a simple file reader, that grabs blocks and report progress, so it can be sent as a background task.

For Avi I got no lookup table, but some api to get frameindex -> data (from disk). So I create a memory block large enough to contain the whole video (file size works perfectly for that purpose ;)

Then In background I go request frames and build a prefix sum:

Code Snippet

public unsafe static AviOffsetTable BuildTable(hapFileVFW fileinfo, CancellationToken token, IProgress<double> progress)
{
    long fileLength = new FileInfo(fileinfo.Path).Length;
    int frameCount = fileinfo.FrameCount;
 
    IntPtr dataPointer = Marshal.AllocHGlobal((int)fileLength);
    IntPtr offsetPointer = dataPointer;
 
    List<OffsetTable> offsetTable = new List<OffsetTable>();
 
    int readBytes = 0;
    int currentOffset = 0;
    for (int i = 0; i < frameCount; i++)
    {
        fileinfo.WriteFrame(i, offsetPointer, out readBytes);
 
        OffsetTable t = new OffsetTable()
        {
            Length = readBytes,
            Offset = currentOffset
        };
 
        offsetTable.Add(t);
 
        offsetPointer += readBytes;
        currentOffset += readBytes;
 
        double prog = (double)i / (double)frameCount;
        progress.Report(prog);
 
        if (token.IsCancellationRequested)
        {
            Marshal.FreeHGlobal(dataPointer);
            throw new OperationCanceledException();
        }
    }
    progress.Report(1.0);
    return new AviOffsetTable(offsetTable, dataPointer);
}

This is simple too, we just ask the avi wrapper to write into our pointer, get number of bytes written and move pointer by that offset for next frame. At the same time we build our offset table.

Once we have our data loaded in memory everything is much simpler :

Code Snippet

public IntPtr ReadFrame(int frameIndex, IntPtr buffer)
{
    if (this.memoryLoader != null && this.memoryLoader.Complete)
    {
        var tbl = this.memoryLoader.DataStream;
        IntPtr dataPointer = tbl.DataPointer;
        var poslength = tbl.Table[frameIndex];
        dataPointer += (int)poslength.Offset;
        return dataPointer;
    }
    else
    {
        int readBytes = 0;
        int readSamples = 0;
        Avi.AVIStreamRead(this.VideoStream, frameIndex, 1, buffer, this.frameSize.Width * this.frameSize.Height*6, ref readBytes, ref readSamples);
        return buffer;
    }
}

in first case we just return a pointer from our lookup table (no memory copy required), in the second case we read from disk.

Preloading content into memory gives a huge performance gain (and memory is rather cheap, easy to have 64 gigs in a single machine, so preload can be a definite good option).

So after that comes all the usual cleanup, manage videos element count and make sure we don't have memory leaks / crashes.

Now I have a really nicely working player, why limit our imagination?

First, I wanted to test some 8k encoding, so I exported a few frames from 4v and tried to use virtualdub for encode in hap. Press Save->Out of memory.

So instead, let's just encode directly from vvvv ;)

Writing encoder was easy, you set avi headers with you video size/framerate/compression, then you only need to get texture from gpu, convert in whichever dxt/bc format you want, compress with snappy if required, write frame.

One thing well done!

Next, since we run on dx11 hardware, we have access to new block compression formats:

BC6: three channels half floating point (hdr playback, mmmmmhhhh)
BC7: 4 channels, better quality than BC3/DXT5, but encoding is really slow

So let's add a few more FourCC, and add option in encoder/decoder:

Now we have new hap Formats:

Code Snippet

public enum hapFormat
{
    RGB_DXT1_None = 0xAB,
    RGB_DXT1_Snappy = 0xBB,
    RGBA_DXT5_None = 0xAE,
    RGBA_DXT5_Snappy = 0xBE,
    YCoCg_DXT5_None = 0xAF,
    YCoCg_DXT5_Snappy = 0xBF,
    RGB_BC6S_None = 0xA3,
    RGB_BC6S_Snappy = 0xB3,
    RGB_BC6U_None = 0xA4,
    RGB_BC6U_Snappy = 0xB4,
    RGBA_BC7_None = 0xA7,
    RGBA_BC7_Snappy = 0xB7,
}

Please note that those formats are also available out of the box in OpenGL (btpc compression).
So any software that use GL3.1 + can take advantage of it (and really softwares should already have moved to a GL4+ core profile, so there are NO excuses ;)

Finally, people always tend to think of videos as just a sequence of images.

Although there are some cases where other formats are more suitable (panoramic/dome projection).

In that case cubemaps are much more suited for this.

Oh and DXT/BC formats support cubemap compression.

So let's just write cubemap data as a frame, which was 0 lines of code in my case, since my writer already supports cubemap export.

Then there's only a little twist, in the avi stream info, don't forget to multiply data required by 6 (yes we now have 6 textures in one frame)

in AVISTREAMINFO :
public Int32 dwSuggestedBufferSize;

Is the field where we initiate buffers.

Then decoding frames works exactly like standard textures (Cubemaps are Texture2D too, so loading is done exactly the same way).

There's of course a little twist, in case of CubeTexture we need to set different parameters on ShaderResourceView creation:

Code Snippet

if (videoTexture.Description.OptionFlags.HasFlag(ResourceOptionFlags.TextureCube))
{
    ShaderResourceView videoView = new ShaderResourceView(device.Device, videoTexture);
}
else
{
    ShaderResourceViewDescription srvd = new ShaderResourceViewDescription()
    {
        ArraySize = 6,
        FirstArraySlice = 0,
        Dimension = ShaderResourceViewDimension.TextureCube,
        Format = videoTexture.Description.Format,
        MipLevels = videoTexture.Description.MipLevels,
        MostDetailedMip = 0,
    };
 
    ShaderResourceView videoView = new ShaderResourceView(device.Device, videoTexture,srvd);
}

That's more or less it, cube texture encoding/playback with Hap:

Some days well spent!!

Saturday 25 October 2014

Hap attack (and Quicktime fun)

A little while ago I got asked to add Hap support in vvvv.

This is a rather simple format, idea is that you get a BC1/BC3 frame (with small snappy compression), so you can do fast GPU upload.

It's more or less the scheme used by many "media servers", one difference is that all is packed in a single file instead of a bunch of dds files.

It's a pretty useful format since frame load is very fast, and can even be done within the frame, so you can have perfect synchronisation between videos on a single (or multiple) machines.

So the first step is to simply decode a frame, as a test rig I just used a media foundation source reader, which hapilly gives me a sample (aka: a frame), in compressed form.

Once you have this, everything is reasonably straightforward:

First 4 bytes are [length] (3 bytes) + Flag (1 byte)

Flag gives you compression + format like this (c#):

Code Snippet

public enum hapFormat
{
    RGB_DXT1_None = 0xAB,
    RGB_DXT1_Snappy = 0xBB,
    RGBA_DXT5_None = 0xAE,
    RGBA_DXT5_Snappy = 0xBE,
    YCoCg_DXT5_None = 0xAF,
    YCoCg_DXT5_Snappy = 0xBF
}

Once you have this, you need to call Snappy to decompress (if relevant):

Code Snippet

int uncomp = 0;
SnappyStatus st = SnappyCodec.GetUncompressedLength(bptrData, frameLength, ref uncomp);
st = SnappyCodec.Uncompress(bptrData, frameLength, (byte*)snappyTempData, ref uncomp);
initialData = snappyTempData;

I just used an existing P/Invoke wrapper, no need to waster time reinventing the wheel:

http://snappy4net.codeplex.com/

Now you have your frame ready, you just have to upload to your GPU :

Code Snippet

Texture2DDescription textureDesc = new Texture2DDescription()
{
    ArraySize = 1,
    BindFlags = BindFlags.ShaderResource,
    CpuAccessFlags = CpuAccessFlags.None,
    Format = format.GetTextureFormat(),
    Height = this.frameSize.Height,
    Width = this.frameSize.Width,
    MipLevels = 1,
    OptionFlags = ResourceOptionFlags.None,
    SampleDescription = new SharpDX.DXGI.SampleDescription(1, 0),
    Usage = ResourceUsage.Immutable
};
 
DataRectangle dataRectangle = new DataRectangle(initialData, format.GetPitch(this.frameSize.Width));
Texture2D videoTexture = new Texture2D(this.device, textureDesc, dataRectangle);
ShaderResourceView videoView = new ShaderResourceView(this.device, videoTexture);

format.GetTextureFormat() takes care of properly converting the hap format to the relevant BC texture format.

That was about it to decode a frame, hardcore work ;)

Now as usual this part is only the tip of the iceberg, you have to think how to handle playback.

My initial thought was to continue using Media Foundation source reader, so I write a little player and check decode time, around 1.5ms per full hd frame (on a laptop with no SSD).

So all is pretty promising, really fast decode access, but then you reach the point where you want to loop your video, (which involves calling SetCurrentPosition on your source reader).

Surprisingly, this is extremely slow, grabbing a frame after a seek suddenly takes 60ms (which is far too much obviouly). That completely removes random play (seek every frame) as well.

So back to the good old windows AVI api, which just parses file and allows you to load a random frame in memory.

First we load the file:

Code Snippet

Avi.AVIFileInit();
            fileHandle = 0;
            int r = Avi.AVIFileOpen(ref fileHandle, @"E:\repositories\cartfile\other\EncodingTest\sample-1080p30-Hap.avi", Avi.OF_READWRITE, 0);
 
            Avi.AVIFileGetStream(fileHandle, out videoStram, Avi.streamtypeVIDEO, 0);
 
            Avi.AVISTREAMINFO streamInfo = new Avi.AVISTREAMINFO();
            Avi.AVIStreamInfo(videoStram, ref streamInfo, Marshal.SizeOf(streamInfo));
 
            Avi.BITMAPINFO bi = new Avi.BITMAPINFO();
 
            int biSize = Marshal.SizeOf(bi);
 
            Avi.AVIStreamReadFormat(videoStram, 0, ref bi, ref biSize);
 
            SharpDX.Multimedia.FourCC fcc = new SharpDX.Multimedia.FourCC(bi.bmiHeader.biCompression);

Please note that we get the Hap FourCC in the bitmap compression header (so we of course add a check to verify our AVI is encoded using Hap.

Now to get a frame, we simply call:

Code Snippet

Avi.AVIStreamRead(this.videoStram, frameIndex, 1, this.aviTempData, 1920 * 1080, 0, 0);

With our frame Index. Since Hap uses one keyframe per frame, this is extremely fast.
Once done we upload to GPU as previous.

Now bit of code to integrate into vvvv, just wrap all that lot into some plugins:

That's pretty much it. Please note that upload is so fast that I didn't bothered yet to do any buffering. I quite like the concept of "ask for this frame and get it" :)

Now one thing is that hap can have 2 containers. Avi (from the directshow codec), or MOV (from the quicktime codec).

Of course most hap files from people using this thing called Mac will encode using the second codec. It's actually easy to change container, but well, it would be much better to read quicktime files directly.

That causes an initial problem, you need Quicktime installed on Windows (which sucks), and that implies to use the QuickTime SDK for windows (which has been abandoned more than 5 years ago). So that also mean forget 64 bits support.

As a side note, I can limit my use case, I only want to read Hap, if a video is from another codec my player just will not accept it.

So let's see if we can't just parse that mov file and just extract raw data like we do using AVI.

Here only difference, I did not found any wrapper (like vfw.h does for AVI), so time to go read specifications and open Hexadecimal editor ;)

For people interested, I will leave you to read the whole specs :
https://developer.apple.com/library/mac/documentation/QuickTime/QTFF/QTFFPreface/qtffPreface.html#//apple_ref/doc/uid/TP40000939-CH202-TPXREF101

But let's summarize,

QuickTime files use the concept of Atom (which more or less just a Node in a tree structure).
Each Atom has a length, a code (fourcc) and can either contain other atoms or data, this is structured this way:

[Length 4 Bytes][FourCC 4 Bytes][Data = Length - 8 Bytes]

Yes length parameter includes itself and FourCC.

Please note there is nothing in the file format to know automatically is an Atom is a Leaf (data) or a Container (contains other Atoms), so you have to go read the documentation and find by yourself.

First atom is called : File Type compatibility (contains a header to check is it's a valid quicktime file, plus few version info.

Next we have "wide", which is a special one to allow to add a flag for large files.

Then we have "mdat" , which contains all the sample data (where we want to read from). But of course for now we don't know how data is organized.

So we need to go to the next one (called "moov"). Which contains all the information we need. There's a really high amount of options, but roughly from there we retrieve frame per second, and track list ("trak" atom).

We can already go into the track header ("tkhd" atom) to retrieve track length / size.

Then our work is not finished, we need to check if our file is Hap, this is contained in the "stsd" atom (Sample Description).

Once we are in the sample table, the most important data is at the reach of our hand (how to find position/length of a frame).

First, data is organized in chunks. A chunk contains one or more samples (so for example you can load the whole chunk from file instead of one at a time).

So we need to enumerate file offset for each chunks, which is contained into the "stco" Atom (Chunk Offset Atom).

Data is simply a prefix table, which contains file offset for each chunk. Please note the offset is absolute to the file, which makes it much easier since once we get that data we don't need to check for child Atom anymore.

Here is the data for a test mov file (powered by Hex Editor and Windows calculator ;)

stco (chunkoffsets)
Chunk 1 Offset : 48
Chunk 2 Offset : 1015901
Chunk 3 Offset : 2030918
Chunk 4 Offset : 3045373
Chunk 5 Offset : 4058366
Chunk 6 Offset : 4348429

Pretty simple, since all is absolute search is also much faster.q

Now, we need to know each Sample (or frame) size.

All frames size are contained in a single Atom ("stsz"), so we go thought them and get frame length:

stsz (sample size table)
Sample 1 : 144974
Sample 2 : 145333
Sample 3 : 145210
Sample 4 : 145344
Sample 5 : 145065
Sample 6 : 144811

Now we still don't know how samples relate to chunks (the last missing piece of the puzzle).

Now we need to read data in "stsc" atom (Sample to Chunk), which is a prefix table.

In my sample mov, this is described this way:

stsc (sample to chunk)
First/Sample Per Chunk/ Descriptor
1 / 7 / 1
5 / 2 / 1
6 / 1 / 1

As you can see this is compressed (7+2+1 = 10), and my file has 31 frames.

So this simply expands, as from 1 to 5 (the first 4 chunks), we have 7 samples.

Which is correct, since 7*4+2+1 = 31

So that's about it, with all that data we are ready to roll, we first build a prefix sum for chunk offsets:
0 - 7 - 14 - 21- 28 - 30

From this it's pretty easy to retrieve in which chunk our frame is contained.

Once we know chunk + position in chunk, we need to iterate using each sample length until we reach our position.

So third frame location is :

First Chunk = 48 + 144974 + 145333

Of course we can precompute all this (per chunk or per sample), so we end up having 2 arrays:
frameindex->location
frameindex->size

Then we just load our frame from file into memory, and rinse and repeat the Hap upload.

So here we go, Hap decoder (small SharpDX standalone sample) , which reads Hap MOV file without any need for QuickTime installed.

Fun times