Wednesday 20 November 2013

Execution Path and IO

I explained in previous post how you can build simple functions and convert them into IL code.

Now what is interesting is you can easily optimize those by using simple assumptions, which are things that you would easily forget while writing code.

You can consider 2 things:
  • Immediate optimizations : For elements that you know at compile time
  • Code Path : You need to define them at runtime depending on inputs size.
Let's see a few Immediate optimizations:

Code Snippet
  1. [SlicewiseMethod(Name = "SimpleLerp", Category = "Value")]
  2. public static void SimpleLerp(
  3.     [Input("Input 1")] double d1,
  4.     [Input("Input 2")] double d2,
  5.     [Input("Amount")] double amt,
  6.     [Output("Output")] out double result)
  7. {
  8.     result = VMath.Lerp(d1, d2, amt);
  9. }

Here we just process a simple lerp.

Since in slicewise operations, our output size is always SpreadMax, Generator will automaticaly select a pointer output,and miss the mod operator.

This is a no brainer, easy as that, outputs never use mod and do direct storage.

Second test:

Code Snippet
  1. [SlicewiseMethod(Name = "Sin", Category = "Value")]
  2. public static void Sin(
  3.     [Input("Input")] double d,
  4.     [Output("Output")] out double dout)
  5. {
  6.     dout = Math.Sin(d);
  7. }

Here we have one input and one output. The compiler can detect that, and automatically set the input read to safely ignore the mod operator too.

Now let's look at our lerp above, and let's say that we want only one lerp value for all elements:

Code Snippet
  1. [SlicewiseMethod(Name = "SimpleLerp2", Category = "Value")]
  2. public static void SimpleLerp(
  3.     [Input("Input 1")] double d1,
  4.     [Input("Input 2")] double d2,
  5.     [Input("Amount", IsSingle=true)] double amt,
  6.     [Output("Output")] out double result)
  7. {
  8.     result = VMath.Lerp(d1, d2, amt);
  9. }

Here we did set the IsSingle Attribute to our variable. Compiler will detect that, read the variable into a local once before the for loop.

Now let's say we want to multiply a spread by a constant value:

Code Snippet
  1. [SlicewiseMethod(Name = "MultiplyFixed", Category = "Value")]
  2. public static void SimpleLerp(
  3.     [Input("Input")] double d,
  4.     [Input("Amount", IsSingle = true)] double amt,
  5.     [Output("Output")] out double result)
  6. {
  7.     result = d * amt;
  8. }

Here as before, our Amount variable will be stored into a local.
But now we also know that our Input Variable is alone (same as the Sine case).
So compiler will remove the mod operator on input value as well!

Now let's see a little bit more complex case:

Code Snippet
  1. [Selector(Name = "Function", Category = "Value Inverse")]
  2. public static Dictionary<string, Func<double, double>> TrigonometryInv
  3. {
  4.     get
  5.     {
  6.         var result = new Dictionary<string, Func<double, double>>();
  7.         result.Add("Asin", (x) => Math.Asin(x));
  8.         result.Add("Acos", (x) => Math.Acos(x));
  9.         result.Add("Atan", (x) => Math.Atan(x));
  10.         return result;
  11.     }
  12. }

Here we basically build a node that Has one input, one output , and and enum to select function.

Equivalent code:

Code Snippet
  1. [PluginInfo(Name = "Function", Category = "Value", Version = "Simple")]
  2. public unsafe class DummyFunc : IPluginEvaluate
  3. {
  4.     public enum eFunc { Sin,Cos, Tan}
  5.     public delegate double FuncDelegate (double input);
  6.  
  7.     [Input("Input")]
  8.     private ValueInput input;
  9.     [Input("Function", IsSingle = true)]
  10.     private ISpread<eFunc> function;
  11.     [Output("Output")]
  12.     private ValueOutput output;
  13.     private Dictionary<eFunc, FuncDelegate> functable = new Dictionary<eFunc, FuncDelegate>();
  14.  
  15.     public DummyFunc()
  16.     {
  17.         functable.Add(eFunc.Sin, (x) => Math.Sin(x));
  18.         functable.Add(eFunc.Cos, (x) => Math.Cos(x));
  19.         functable.Add(eFunc.Tan, (x) => Math.Tan(x));
  20.     }
  21.  
  22.     public void Evaluate(int SpreadMax)
  23.     {
  24.         FuncDelegate f = functable[function[0]];
  25.         output.Length = SpreadMax;
  26.         double* iptr = input.Data;
  27.         double* optr = output.Data;
  28.         for (int i = 0; i < SpreadMax; i++)
  29.         {
  30.             optr[i] = f(iptr[i]);
  31.         }
  32.     }
  33. }

Please note that we could create a loop delegate, like this:

Code Snippet
  1. [PluginInfo(Name = "Function", Category = "Value", Version = "Loopable")]
  2. public unsafe class DummyFunc2 : IPluginEvaluate
  3. {
  4.     public enum eFunc { Sin, Cos, Tan }
  5.     public delegate void FuncLoopDelegate(int cnt, double* input, double* output);
  6.     [Input("Input")]
  7.     private ValueInput input;
  8.  
  9.     [Input("Function", IsSingle = true)]
  10.     private ISpread<eFunc> function;
  11.  
  12.     [Output("Output")]
  13.     private ValueOutput output;
  14.  
  15.     private Dictionary<eFunc, FuncLoopDelegate> functable = new Dictionary<eFunc, FuncLoopDelegate>();
  16.     public DummyFunc2()
  17.     {
  18.         functable.Add(eFunc.Sin, (cnt, input, output) => { for (int i = 0; i < cnt; i++) { output[i] = Math.Sin(input[i]); } });
  19.         functable.Add(eFunc.Cos, (cnt, input, output) => { for (int i = 0; i < cnt; i++) { output[i] = Math.Cos(input[i]); } });
  20.         functable.Add(eFunc.Tan, (cnt, input, output) => { for (int i = 0; i < cnt; i++) { output[i] = Math.Tan(input[i]); } });
  21.     }
  22.  
  23.     public void Evaluate(int SpreadMax)
  24.     {
  25.         FuncLoopDelegate f = functable[function[0]];
  26.         output.Length = SpreadMax;
  27.         double* iptr = input.Data;
  28.         double* optr = output.Data;
  29.         f(SpreadMax,iptr,optr);
  30.     }
  31. }

This doesn't give a very good gain tho, so first option is easier to build.


Now let's look at code path.

We can eventually decide of our loop at runtime, let's take our Vector2 join again:

We have 5 cases we want to consider:
  • Length (x) == Length(y) : We can ignore mod on each vector.
  • Length (x) == 1 : We store x before the loop and sample y at full speed
  • Length (y) == 1 : Same as above, just the other way round.
  • Length (x) > Length(y) : Sample x at full speed, mod on y
  • Length (y) < Length(y) : Opposite
Expressed as code:

Code Snippet
  1. public delegate void VectorLoop(int cnt, double* d1, int i1, double* d2,int i2, Vector2D* output);
  2.  
  3. private Dictionary<eFuncSelector, VectorLoop> functable = new Dictionary<eFuncSelector, VectorLoop>();
  4.  
  5. public DummyVectorDynamic()
  6. {
  7.     functable.Add(eFuncSelector.XMax, (cnt, x, i1,y, i2, o) => { for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i].y = y[i % i2]; } });
  8.     functable.Add(eFuncSelector.YMax, (cnt, x, i1, y, i2, o) => { for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i%i1].y = y[i]; } });
  9.     functable.Add(eFuncSelector.FullSpeed, (cnt, x, i1, y, i2, o) => { for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i].y = y[i]; } });
  10.     functable.Add(eFuncSelector.SingleX, (cnt, x, i1, y, i2, o) => { double xval = x[0]; for (int i = 0; i < cnt; i++) { o[i].x = xval; o[i].y = y[i]; } });
  11.     functable.Add(eFuncSelector.SingleY, (cnt, x, i1, y, i2, o) => { double yval = y[0]; for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i].y = yval; } });
  12. }

Now in our evaluate method:

Code Snippet
  1. public void Evaluate(int SpreadMax)
  2. {
  3.     dout.Length = SpreadMax * 2;
  4.     Vector2D* d = (Vector2D*)dout.Data;
  5.  
  6.     double* pt1 = d1.Data;
  7.     double* pt2 = d2.Data;
  8.     int l1 = d1.Length;
  9.     int l2 = d2.Length;
  10.  
  11.     if (l1 == l2) {
  12.         functable[eFuncSelector.FullSpeed](SpreadMax, pt1, l1, pt2, l2,d);
  13.     } else if (l1 == 1) {
  14.         functable[eFuncSelector.SingleX](SpreadMax, pt1, l1, pt2, l2, d);
  15.     } else if (l2 == 1) {
  16.         functable[eFuncSelector.SingleY](SpreadMax, pt1, l1, pt2, l2, d);
  17.     } else if (l1 > l2) {
  18.         functable[eFuncSelector.XMax](SpreadMax, pt1, l1, pt2, l2, d);
  19.     } else {
  20.         functable[eFuncSelector.YMax](SpreadMax, pt1, l1, pt2, l2, d);
  21.     }
  22. }

Now here are a few benchmarks (considering different cases):

X = 50000, Y = 20
Native: 0.35 ms
Zip : 0.28ms
Mutable : 0.2 ms

This is one of the 2 worst cases, (when X != Y), but we saved a mod on the biggest spread, which technically reflects.

X = 25000, Y = 25000
In that case the node will use : FullSpeed
Native: 0.17ms
Zip : 0.12ms
Mutable : 0.06 ms

In that case we simply ignore all the mod operators, and Mutable clearly wins.

X = 50000, Y = 1
Native : 0.36 ms
Zip : 0.25ms
Mutable : 0.16ms

We can also see that pushing a local and setting other at full speed gives a decent gain. So setting path at runtime is significant.

So technically, that gives us this information:

let function (p1,p2,p3,p4) = dosomething

Now what we have, to generalize our vector, is the following (for each parameter)
if length == 1 -> set local before loop and use that local
if length == SpreadMax -> read without mod
if length < SpreadMax -> read modded

Of course we can clearly see that we have a combinatorial explosion, a function with 8 parameters has a LOT of different cases.

Few options here:
generate the worst case scenario (eg: mod everything).

When you enter Evaluate:
build a mini hash (length == 1) -> 0, (length == SpreadMax) -> 1, (length < SpreadMax) -> 2

Ask the runtime if we already have the function, if not compile it and give it back.

Run the function.

Yes, clearly as you've seen, we have nodes that self compile ;)

So this is of course to use with caution, if your node input length vary a lot, you might have a lot of recompile.
If you have reasonably steady spread counts, you'll have a little overhead on startup. Please note that something like:

let function (a,b) => a+b
a = 2000, b = 1
next frame , a = 1000, b= 1

will not cause a recompile, since a is is still at SpreadMax.

On Some specific notes, node compiler is about 0.5 millisecond, which is pretty fast, but if you have 1000 nodes rebuilding on the fly, that's not too good ;)

What is good is that our node is now a basic container, so we save some il code. Second very cool thing, our logic becomes a delegate, which is much easier to wrap into a Task, but more on that later.

No comments:

Post a Comment