Monday 18 November 2013

Into Code to IL to Node

I changed the title a bit, but yes I'll kinda speak again of code generation somehow.

I remember some time ago devvvvs shown some prototype to have simpler ways to makes nodes. Basically you create a function and it generates boilerplate code around.
Sadly it never really made it through (maybe they keep it for their new version).

Basically you build a function and you generate the node to fit it.

I started to work on this, I called the project MicroNodes ;)

Here is a simple example

Code Snippet
  1. [SlicewiseMethod(Name = "+", Category = "Value")]
  2. public static void Add([Input("Input 1")] double d1, [Input("Input 2")] double d2, [Output("Output")] out double dout)
  3. {
  4.     dout = d1 + d2;
  5. }

Ok That's the infamous Add (without group), but you see the idea.

You basically have your operator, and then I build a plugin in intermediate language that prepares data for each slice, then calls the function.

So a dummy implementation looks like this:

Code Snippet
  1. [PluginInfo(Name = "Add", Category = "Value", Version = "Slicewise Simple")]
  2. public class DummyAdd3 : IPluginEvaluate
  3. {
  4.     [Input("Input 1")]
  5.     private ISpread<double> d1;
  6.  
  7.     [Input("Input 2")]
  8.     private ISpread<double> d2;
  9.  
  10.     [Output("Output")]
  11.     private ISpread<double> dout;
  12.  
  13.     public void Evaluate(int SpreadMax)
  14.     {
  15.         dout.SliceCount = SpreadMax;
  16.         double dou;
  17.         for (int i = 0; i < SpreadMax; i++)
  18.         {
  19.             SliceWise.Add(d1[i], d2[i],out dou);
  20.             dout[i] = dou;
  21.         }
  22.     }
  23. }

You can notice I use output parameter in that case, that will fit some other parts. Here is the same implementation using operator:

Code Snippet
  1. [PluginInfo(Name = "Add", Category="Value",Version="Operator Simple")]
  2. public class DummyAdd : IPluginEvaluate
  3. {
  4.     [Input("Input 1")]
  5.     private ISpread<double> d1;
  6.  
  7.     [Input("Input 2")]
  8.     private ISpread<double> d2;
  9.  
  10.     [Output("Output")]
  11.     private ISpread<double> dout;
  12.  
  13.     public void Evaluate(int SpreadMax)
  14.     {
  15.         dout.SliceCount = SpreadMax;
  16.         for (int i = 0; i < SpreadMax; i++)
  17.         {
  18.             dout[i] = d1[i] + d2[i];
  19.         }
  20.     }
  21. }

You can notice that it's very easy to generate intermediate language for Slicewise methods.

First interesting bit is that both of those are running at same speed. Reason being simple, the JIT, when it converts the Intermediate language into machine code will inline the function call, so assembly wise it's pretty similar.

That's good news, since it means I can easily use functions and call has no overhead.

Now let's see how we can optimize this.
First, any slicewise operation, output result count = spreadmax.
Doing dout[i] does the following in Intermediate Language:
virtcall ISpread<T>._setcyclic()

Setting a slice like this does (for every single write)

  • Mark the Spread as dirty (eg: it needs to flush data back into vvvv) 
  • Go access the internal DataStore (an array basically)
  • Do a Mod to store, output[i%icount] = data
That's actually a lot.

So now we can access the internal store ourselves, this is exposed on Spread Type.

Here is a better version:

Code Snippet
  1. [PluginInfo(Name = "Add", Category = "Value", Version = "Slicewise Buffer")]
  2. public class DummyAdd4 : IPluginEvaluate
  3. {
  4.     [Input("Input 1")]
  5.     private ISpread<double> d1;
  6.  
  7.     [Input("Input 2")]
  8.     private ISpread<double> d2;
  9.  
  10.     [Output("Output")]
  11.     private ISpread<double> dout;
  12.  
  13.     public void Evaluate(int SpreadMax)
  14.     {
  15.         dout.SliceCount = SpreadMax;
  16.         double[] b = dout.Stream.Buffer;
  17.         for (int i = 0; i < SpreadMax; i++)
  18.         {
  19.             SliceWise.Add(d1[i], d2[i], out b[i]);
  20.         }
  21.         dout.Flush(true);
  22.     }
  23. }

Now, since we access the buffer (and we know we never have to use Mod for output), we get access to the array, and can pass the index as output directly,this does this in IL:

  IL_0039:  ldloc.0 (This is our array local variable)
  IL_003a:  ldloc.1  (This is index)
  IL_003b:  ldelema    [mscorlib]System.Double

So instead it loads the address of the array and pass it to the function.

Pleace note that I call Flush(true), since now I write directly into the internal store, the dirty flag if never set.

So it's easy to add this in the IL generator, and it Gives you a decent substantial boost.

For 5000 elements:
ISpread virtual : 0.1 ms
ISpread buffer : 0.06 ms

For 50000 elements
ISpread virtual : 1 ms
ISpread buffer : 0.6 ms

It's pretty low values but it can easily add up.

So now we can use pointer output, since we can write directly back into memory.
This doesn't give a decent improvement, since a memcpy on 5000 elements is blazing fast, But at least we save some memory (40 kilobytes for a 5000 spread).

It seems kinda minimal, but in a reasonably large patch, it also adds up quickly.

So now the obvious next part is to modify our input. We also do a virtcall for access which also does a mod.

Operator version:

Code Snippet
  1. [PluginInfo(Name = "Add", Category = "Value", Version = "Operator Ptr Full")]
  2. public unsafe class DummyAdd8 : IPluginEvaluate
  3. {
  4.     [Input("Input 1")]
  5.     private ValueInput d1;
  6.  
  7.     [Input("Input 2")]
  8.     private ValueInput d2;
  9.  
  10.     [Output("Output")]
  11.     private ValueOutput dout;
  12.  
  13.     public void Evaluate(int SpreadMax)
  14.     {
  15.         dout.Length = SpreadMax;
  16.         double* d = dout.Data;
  17.  
  18.         double* pt1 = d1.Data;
  19.         double* pt2 = d2.Data;
  20.  
  21.         int l1 = d1.Length;
  22.         int l2 = d2.Length;
  23.  
  24.         for (int i = 0; i < SpreadMax; i++)
  25.         {
  26.             d[i] = pt1[i%l1] + pt2[i%l2];
  27.         }
  28.     }
  29. }

And Static function version:

Code Snippet
  1. [PluginInfo(Name = "Add", Category = "Value", Version = "Slicewise Ptr Full")]
  2. public unsafe class DummyAdd10 : IPluginEvaluate
  3. {
  4.     [Input("Input 1")]
  5.     private ValueInput d1;
  6.  
  7.     [Input("Input 2")]
  8.     private ValueInput d2;
  9.  
  10.     [Output("Output")]
  11.     private ValueOutput dout;
  12.  
  13.     public void Evaluate(int SpreadMax)
  14.     {
  15.         dout.Length = SpreadMax;
  16.         double* d = dout.Data;
  17.         double* pt1 = d1.Data;
  18.         double* pt2 = d2.Data;
  19.         int l1 = d1.Length;
  20.         int l2 = d2.Length;
  21.         for (int i = 0; i < SpreadMax; i++)
  22.         {
  23.             SliceWise.Add(pt1[i % l1], pt2[i % l2], out d[i]);
  24.         }
  25.     }
  26. }

As before Operator vs Static function gets pretty much 0 difference, which is still good news, since it means that we can still easily generate that IL code.

It yields a difference compared to ISpread still, here is the table:

For 5000 elements:
ISpread virtual : 0.1 ms
ISpread buffer : 0.06 ms
Ful Pointer : 0.03 ms

For 50000 elements
ISpread virtual : 1 ms
ISpread buffer : 0.6 ms
Ful Pointer : 0.3 ms

Please note that we also save memory for each pin, so let's say we have 5000 elements in first spread and 500 in second, we gain : (5000*8*2) + (500*8) = 84 kilobytes

That's quite pretty decent specially also considering that we have a node that is much faster.

So technically speaking, one very fun thing is that by allowing the user to build this first function as above:

Code Snippet
  1. public static void Add(double d1, double d2, out double dout)
  2. {
  3.     dout = d1 + d2;
  4. }

By choosing a route in our IL generator we can make a node 3 times faster compared to the dummy implementation that comes from template.

Also please note that it's much easier to reuse that function or post optimize.
If you have 50 nodes like this, finding a new speedup is a decent amount of refactoring, here, none is needed.

Now I'll just give a second example, since now devvvs have presented their new 5x to 6x faster Vector Join, split, I might put them to the test :)

First thing is indeed, compared to last version they are much faster, no brainer, which is great!

Please note that there's also Zip (Value), which is normally pretty optimized.

So Let's see, Vector (2d Join)

5000 Elements
Native : 0.035 ms
Zip : 0.035 ms

50000 Elements
Native : 0.37 ms
Zip : 0.32 ms

So now I decided to test 3 techniques, so are more code generation friendly, some are more raw plugin, here we go.

First a Vector Join function is simply like this:

Code Snippet
  1. [SlicewiseMethod(Name = "Vector", Category = "2d Join")]
  2. public static void V2dJoin(
  3.     [Input("X")] double x,
  4.     [Input("Y")] double y,
  5.     [Output("Output")] out Vector2D result)
  6. {
  7.     result = new Vector2D(x, y);
  8. }

Really simple no?

Now here are the testers.

Method one:
Code Snippet
  1.   [PluginInfo(Name = "Vector", Category = "2d Join", Version = "Ptr")]
  2.   public unsafe class DummyVector: IPluginEvaluate
  3.   {
  4.       [Input("Input 1")]
  5.       private ValueInput d1;
  6.  
  7.       [Input("Input 2")]
  8.       private ValueInput d2;
  9.  
  10.       [Output("Output")]
  11.       private ValueOutput dout;
  12.  
  13.       public void Evaluate(int SpreadMax)
  14.       {
  15.           dout.Length = SpreadMax*2;
  16.           double* d = dout.Data;
  17.  
  18.           double* pt1 = d1.Data;
  19.           double* pt2 = d2.Data;
  20.  
  21.           int l1 = d1.Length;
  22.           int l2 = d2.Length;
  23.  
  24.           for (int i = 0; i < SpreadMax; i++)
  25.           {
  26.               *d = pt1[i % l1];
  27.               d++;
  28.                 *d=pt2[i % l2];
  29.               d++;
  30.           }
  31.       }
  32.   }

Pretty simple, we push per component.

Method 2:

Code Snippet
  1. [PluginInfo(Name = "Vector", Category = "2d Join", Version = "Ptr Struct")]
  2. public unsafe class DummyVector2 : IPluginEvaluate
  3. {
  4.     [Input("Input 1")]
  5.     private ValueInput d1;
  6.  
  7.     [Input("Input 2")]
  8.     private ValueInput d2;
  9.  
  10.     [Output("Output")]
  11.     private ValueOutput dout;
  12.  
  13.     public void Evaluate(int SpreadMax)
  14.     {
  15.         dout.Length = SpreadMax * 2;
  16.         Vector2D* d = (Vector2D*)dout.Data;
  17.  
  18.         double* pt1 = d1.Data;
  19.         double* pt2 = d2.Data;
  20.  
  21.         int l1 = d1.Length;
  22.         int l2 = d2.Length;
  23.  
  24.         for (int i = 0; i < SpreadMax; i++)
  25.         {
  26.             d->x = pt1[i % l1];
  27.             d->y = pt2[i % l2];
  28.             d++;
  29.         }
  30.     }
  31. }

This is almost the same, but we build as struct.

Now actually we can also simply use our slicewise:

Code Snippet
  1. [PluginInfo(Name = "Vector", Category = "2d Join", Version = "Ptr Out")]
  2. public unsafe class DummyVector3 : IPluginEvaluate
  3. {
  4.     [Input("Input 1")]
  5.     private ValueInput d1;
  6.  
  7.     [Input("Input 2")]
  8.     private ValueInput d2;
  9.  
  10.     [Output("Output")]
  11.     private ValueOutput dout;
  12.  
  13.     public void Evaluate(int SpreadMax)
  14.     {
  15.         dout.Length = SpreadMax * 2;
  16.         Vector2D* d = (Vector2D*)dout.Data;
  17.         double* pt1 = d1.Data;
  18.         double* pt2 = d2.Data;
  19.         int l1 = d1.Length;
  20.         int l2 = d2.Length;
  21.         for (int i = 0; i < SpreadMax; i++)
  22.         {
  23.             Joiner.Join(pt1[i % l1], pt2[i % l2], out d[i]);
  24.         }
  25.     }
  26. }

So now the results:

5000 Elements
Native : 0.035 ms
Zip : 0.035 ms
Per Component: 0.035 ms
As Struct: 0.031 ms
Function Output:  0.038 ms

50000 Elements
Native : 0.37 ms
Zip : 0.32 ms
Per Component: 0.35 ms
As Struct: 0.33 ms
Function Output:  0.34 ms

Please note that values fluctuate a little bit.

But basically that means few things:

  • Devvvvs did a good job at optimizing vectors
  • On very low spread (50), zip seems slower, native is of course fastest (since you have COM object call). It is not as obvious tho.
  • A generated version from function is as Fast as optimized native ;)

So to resume, IL code generation is pretty cool.

Now you have a lot of other use cases than SliceWise.
What I decided was to build different templates, so for example:

Code Snippet
  1. [ReductorMethod(Name = "+", Category = "Value Spectral")]
  2. public static void SpectralAdd([Input("Input")] IEnumerable<double> data,[Output("Result")] out double result)
  3. {
  4.     result = 0.0;
  5.     foreach (double d in data) { result += d; }
  6. }

You can also do Struct Join/Split like this:

Code Snippet
  1. [Compactor(Name="Struct",Category="Test")]
  2. public struct MyStruct2
  3. {
  4.     [CompactorProperty(Name = "Hello")]
  5.     public double Hello;
  6.  
  7.     [CompactorProperty(Name = "Element2")]
  8.     public double Element2;
  9.  
  10.     [CompactorProperty(Name="Vector",Flatten=true)]
  11.     public Vector3 Vector;
  12. }

Or have a visitor class in case you want to use an existing struct.

There is some other types/parts, but that will be for another post.


1 comment:

  1. thanks man
    your snippets provided a simple way to access a spread without all the supersafe % to accomplish http://vvvv.org/contribution/map-%28advanced%29

    I have read through some of your other stuff and it got me thinking. Making this into a compute shader would require a different approach to optimizing, but it would not be that hard. Same node, really, but they'd end up in two different nodes, one for StructuredBuffer, one for Value.

    So this question arose: If nodes could be coded (or scripted) in an agnostic way, so they perform on gpu or cpu, depending only on their position in the graph, could your code generation be a way for both fx and il? And if only to give working templates for further optimization?

    ReplyDelete