Catflier: Parallel Reduction basics

Very often in computer graphics we need some form of parallel reduction.

For example, I stumbled on a vvvv blog post about boids simulation. In that case there is two main things you need for efficient simulation:

Average of all positions.
Acceleration structure for neighbours

Since actually both techniques use similar concepts, but the second one is slightly more advanced, I'll explain the first one.

So here is the deal:

We have let's say 4096 (64*64) boids, and we need average for some rules.

Our boids are stored in a StructuredBuffer (so we can Read/Write into it).

If we do this in CPU, we have something like this (pseudo code)

float3 sum;

foreach(boid in boids)

{

sum += boid.pos;

}

sum /= (float)boidscount;

Of course, problem is our data is in GPU, so we don't want to transfer back and forth.

What is actually fun in DirectX11 is you have many ways to cover the same technique, let go trough a few of them.

1/Additive Render

Create a 1x1 texture (R32G32B32A32_Float)

Render all your boids, vertex shader is like this:

Code Snippet

struct vsInput
{
    uint iv : SV_VertexID;
};
 
struct psInput
{
    float4 screenpos : SV_POSITION;
    float4 objectpos : TEXCOORD0;
};
 
psInput VS(vsInput input)
{
    psInput output;
    output.screenpos = float4(0,0,0,1);
    output.objectpos = float4(PositionBuffer[input.iv],0.0f);
    return output;
}

Here we just set position to be 0,0 (since anyway we have single pixel), Boid position is sent to pixel shader as Texture Coordinate.

No here is our (hardcore) pixel shader:

Code Snippet

float4 PS(psInput input): SV_Target
{
    return input.objectpos;
}

We only return object position (please make sure to set Blend to Additive).

Now our single pixel contains the sum of all positions.

Divide this by boids count (in another pixel shader or a single compute shader), and you have your average position.

2/Good old MipMap

For this one, this is also pretty simple. Create a texture in R32G32B32A32_Float format as well, big enough to fit all boids (in our case 64*64 fits perfectly).

Now render your boids as PointList (so each boids position must arrive in a single output pixel).

So either draw 4096 elements, position boids from SV_VertexID to match a pixel (basic 1D->2D conversion).

Alternatively you can make an instanced Draw Call (like 64 instances of 64 vertices each), So you will have SV_VertexID and SV_InstanceID as vertex shader input (each one representing row/column index)

Write boid position in the pixel as above (no additive this time, just a plain write).

Now just call GenerateMips on your texture, and your average position is in last mipmap.

So now since we use DirectX11 and have access to compute shaders, we might as well use them.

So here we are, compute shader way.

3/Dummy Loop

Easiest technique to compute average is a simple dummy loop.

You send a 1,1,1 dispatch, and here is the compute shader code:

Code Snippet

[numthreads(1,1,1)]
void CS_Average(uint3 dtid : SV_DispatchThreadID)
{
    float3 sum = 0;
    uint cnt, s;
    PositionBuffer.GetDimensions(cnt,s);
    for (uint i = 0; i < cnt; i++)
    {
        sum += PositionBuffer[i];
    }
    
    RWAverageBuffer[0] = sum * invcnt;
    
}

This (excluding simplicity), is an example on how NOT to use compute shaders.

You use a single thread which does all the job, so you have plenty of threads doing nothing and one doing big job, which is more like a serial algorithm.

But this will be useful at a later stage. Doing the whole calculation like that is a NO GO tho.

4/Iterative

Now let's improve that a bit. we want to use [numthreads(64,1,1)]

We have 4096 elements (eg: 64*64).

So what we'll do is the following:
Create a buffer with 64 elements.

Each of our threads will work on a part of the sum.

Code Snippet

StructuredBuffer<float3> PositionBuffer;
 
RWStructuredBuffer<float3> RWAverageBuffer : BACKBUFFER;
 
[numthreads(64,1,1)]
void CS_Average(uint3 dtid : SV_DispatchThreadID)
{
    float3 sum = 0;
    for (uint i = 0; i  < 64; i++)
    {
        uint idx = i * 64 + dtid.x;
        sum += PositionBuffer[idx];
    }
    
    RWAverageBuffer[dtid.x] = sum;
}
 
[numthreads(64,1,1)]
void CS_AverageTransposed(uint3 dtid : SV_DispatchThreadID)
{
    float3 sum = 0;
    for (uint i = 0; i  < 64; i++)
    {
        uint idx = dtid.x * 64 + i;
        sum += PositionBuffer[idx];
    }
    
    RWAverageBuffer[dtid.x] = sum;
}

So now we have a 64 elements buffer containing a part of that sum. But we made efficient use of your threads (please note that I show two read patterns , "row or column" based).

Now what we can simply do it run our dummy loop above, but it will only run on 64 elements.

So by decomposing like this and adding an extra dispatch, we allowed the first part to run MUCH faster, and leaving a very little job to do for our unefficient shader, which gives a (very) significant gain.

Let's see another way

5/Using groupshared and Thread Sync

Now let's look if we can do this in Compute and a Single pass. Since our element count is not that high it should quite easily allow it.

We've seen that we store temporary sum into a structured buffer. Now compute shader have a small memory space where they can share data (it's big enough to fit our 64 elements).

So here are declarations:

Code Snippet

float invcnt;
StructuredBuffer<float3> PositionBuffer;
 
 
RWStructuredBuffer<float3> RWAverageBuffer : BACKBUFFER;
 
//Used for local averages
groupshared float3 averages[64];

Here nothing that complicated, you can just notice the groupshared declaration to store temporary results.

Now here is Compute Shader code.

Code Snippet

[numthreads(64,1,1)]
void CS_Average(uint3 dtid : SV_DispatchThreadID)
{
    //Compute 64 averages in parallel
    float3 sum = 0;
    for (uint i = 0; i  < 64; i++)
    {
        uint idx = dtid.x * 64 + i;
        sum += PositionBuffer[idx];
    }
    
    averages[dtid.x] = sum;
    
    /*We need to wait for all threads to execute,
    so our groupshared is ready to use */
    GroupMemoryBarrierWithGroupSync();
    
    //Just make sure only one thread finish the job
    if (dtid.x > 0) { return ; }
    
    float3 endsum = 0;
    for (uint j = 0; j  < 64; j++)
    {
        endsum += averages[j];
    }
    
    RWAverageBuffer[0] = endsum * invcnt;
}

Here we do the same dispatch as above, but threads write into this temporary buffer instead of on the output.

Then to apply final pass, we need to wait for all threads to have finished to process their writes, which is what GroupMemoryBarrierWithGroupSync does.

It will block all threads until all have finished (and since in our case they do pretty much the same job, the stalling should be fairly minimal).

Now we keep only one thread for the rest of the execution (other ones have finished their job), which performs the last small loop, but reading from groupshared instead.

That's it, instead we've done in in a single pass!

Performance wise, it's more or less equivalent (except it saves us a second dispatch on CPU). In that scenario it saves us on dispatch and to create one buffer, which is always handy also.

Please note that something like 64*64 is some pretty ideal scenario which might not reflect real life, so this example will potentially require a bit more gymnastics.

Performance tuning with groupshared is fairly complex, so I advise to test iterative/group versions on your scenarios, with a very large number of element some efficient iterative algorithm might win (also this will be very architecture dependent).

Of course, we can notice that compute shader version would also work perfectly for min/max like operations (which can be useful for some other purposes).

That's it for now )

Catflier

Friday, 11 October 2013

Parallel Reduction basics

1/Additive Render

2/Good old MipMap

3/Dummy Loop

4/Iterative

5/Using groupshared and Thread Sync

1 comment: