Catflier: Graph API in FlareTic

Before a (long) post explaining new feature set which is going into FlareTic (and that is hefty believe me), I thought I would speak about the internals of it a little bit.

So let's start with graph evaluation.

Most node based software follow a single paradigm:
-Data driven (vvvv)
-Event driven (max)
-Bit of mix in between (vuo)

That means all your composition has to fit that model, which for me was always feeling non natural and unefficient.

So in contrast, FlareTic provides different patch types, each of those can have whatever paradigm they want.
Also node set is dependent on patch type, so some nodes are only allowed in some patch and not in others.

That gives some advantages/drawbacks, let's start by drawbacks:

It can be a little bit confusing since when you create a new patch type you have to learn about how it works.
Since each node set is on a per patch basis, you have a potential of high node duplication.

Now it gives some advantages:

Since each patch type is isolated with some node set, it's very easy to optimize each type.
I can create some patch types that do different things (State machine for example)
To minimize node duplication, there is a connector system which allows transfer between patches based on some rules.
Having a limited node set per patch allows to keep those patch clean (no mix up between logic/rendering for example).
I could create a patch type which does everything if I want to ;)

So here is what we have for the moment:

Root patch: Mix of Event/Data driven (event for devices, data for textures)
Automation patch : Data driven, since in that case it's the easiest model.
SceneGraph : No real model, everything is dealt on connections, and nodes define their own rules. So links are purely interface based, and a node will decide if it wants it or not. This gives a lot of work to nodes, but allows to have extremely simple pipeline (for example, particle system node can accept any emitter/collider/behavior, organizes all that lot internally, so we don't need any fancy grouping on patching side). It also ensures code quality, if a new behavior is created it will fit in the system directly.
State Machine patch : No model either, since it rebuilds state machine when you add states/create links... after the state machine is passed around but there's no more graph evaluation
Shader patch : Same as above, patch actions notify the patch to rebuild shader, then shader is passed as single object.

One big advantage over it is I can create a new patch type (in progress, there's an OpenCV patch, which runs in background and auto sync), define it's rules and I get some new optimized version for it. Not allowing to mess every node together also ensures some code quality, but more on this in next post too, there's few other parts in progress ;)

Now also since I decided to follow the rule "less nodes/more workload", it fits pretty well in the pipeline, since I hardly need to optimize evaluation, Node set ensures I have quality components. New nodes need to fit into the pipeline, so it doesn't get polluted by lot of random nodes.

On evaluation, there's also no real fanciness, no parallel graph evaluation, no real hardcore graph branch elimination (that's only dealt by pin who tells if they want data or not), and if a patch does not run it does not run, which is just the easiest thing ever, but so efficient in 99% cases ;)

So why no parallel evaluation? Well I don't adhere to the fact that parallel is faster (not always at least, and can also be much slower). I think it's just some marketing crap for people who have no clue about programming internals.

So let's explain a bit.

Let's take a simple example (the one presented in one of those random vuo demo).

You have this graph

So as we can see A provides data to B and C, which are have no linear dependency, then both of those provide data to D, which need to create a barrier (eg : make sure B and C have evaluated) before to execute.

So we could easily say "OH, if we run B and C in separate threads we'd have a speed gain, let's do that!!"

NO!!!!

Let's explain : Thread synchronization has a cost, and here we have 4 operations which so minimal (one Add, one

Multiply , one Divide, plus little bit of graph parsing) that you'd have processed all that data serially before to dispatch threads/barrier.

Now let's say C has a huge workload (I added a large set of random values to the second pin as per the picture):

Now in that case running B and C in parallel would give no or very little gain, since you'll have one node (B) doing more or less no work, and one node (C) which will be pretty busy.

In that case a real gain would be to have C to spawn multiple threads, since a multiply is easy to make parallel (4 threads processing 100000/4 elements each). Running B and C concurrently is totally useless.

Only good use case for such a scheme would be:

In that case we have B which also has a decent workload, so running both concurrently would give some performance improvement. But....

Now let's look a bit closer again:

B still has much less workload than C, so D will have to wait for C again.

If now we push 10 times more data to C, D will still wait for C.

So bottleneck stays C.

Since both operators are easy to parallelize, running data parallel on B and C would still give a much higher performance boost.

Now we can also say? Why not Using both techniques? (eg: run task parallel and data parallel)

Well you'll run into issues, it can be done, but it's pretty complex. Now you'll have B and C fighting for processing time, so it's not as easy as it looks.

Now just to add a little bit, let's add 2 more nodes E and F.

Now what path is your scheduler gonna choose?
3 threads for B,C,E, barrier on D, barrier on F
2 threads, B,C,D in serial, E on it's own, barrier on F?

Since each node can also have different workload (which can also vary over time), well combinatorial explosion on the way.

And here I got only 6 nodes (I don't count the iobox ;) think of a composition with 1000 nodes (which is a small composition in general if you have low level node set).

Also add the fact that composition needs to submit data to gpu for rendering (Since we don't want to do hello world all the time ;) , GPU having parallel command list on it's own, You also have (like in my last project) several threads for network sync,camera/video decoding, texture streaming. In that case adding more cores into your graph might also pollute those and make you whole application run slower.

I'm not even mentioning that if you need to render the data above, you can replace that patch with a compute shader and blow away performance anyway (by a pretty huge factor).

So basically, multicore graph processing is hard, don't advertise it unless you can show some proper rock features, not an hello world with 4 nodes which doesn't even benefit from it ;)

Best solution (but not "designer friendly", is to give use the choice, to let them decide where they know some parts of their compositions are intensive and select that manually).

So here we are, next time I'll go a bit deeper into the Scene Graph, shader system (before the big fat post)

Stay tuned ;)

Catflier

Wednesday, 2 October 2013

Graph API in FlareTic

2 comments: