For fun gatherings with friends

 

n00b tip: Gather round, children

 

I’ve been thinking quite a bit about “gather” loads recently.  By gather loads, I mean loading stuff from all over memory into one vector. Larrabee has support for this built in, making all kinds of cool stuff possible, but the SPUs cant even load a scalar.  Memory is loaded and stored one 16 byte aligned quadword at a time.  Take a look at the following bit of unoptimized C code.
























Thats the kind of thing I am talking about.  We have a table of offsets packed in a vector.  We want to use those offsets as an index into some area of local store, load that byte, and pack 16 of them into a vec_uchar16.  While the above code is pretty compact, the resulting assembly is anything but.  Because you have to generate shuffle masks, rotate, and do all kinds of work, the assembly looks like this:









































































Aarrgghh, indeed!  For anyone not used to working on the SPUs or any CPU that lacks scalar instructions, its a little shocking.  The compiler was smart enough to pull the loop-invariant shuffle masks outside the loop ( cbd instruction ) but the loop itself is a total mess.  We should be able to do better.


I thought about it for a few hours and came up with 4 alternatives.  Many of them involved calculating a rotate amount from the bottom 4 bits of the load address, but the one that ended up being the fastest was one that didn’t even use any rotates ( except the ones that extract the offsets from addresses 1, 2, 3, and 4 ).  I’ve been told that all the really good PS3 programmers take pictures of notes and use it as their presentations, so in an effort to try and make myself seem like I know what I am talking about, here is an outline of my idea.  Ignore my 3rd grade handwriting,































As you can see, we are going from a vector of address offsets to some actual pixels packed in a vector.  I am assuming we are starting at a point where we have all our data loaded into vectors, albeit in the wrong position.  First we take the offsets in the table and bitwise and by 0xF.  This gives us the bottom 4 address bits which correspond to the pixel we want’s offset inside the loaded vector. 


Then we add {0, 16, 0, 16} to it.  When making a shuffle mask, 0-15 selects bytes from the first vector and 16-31 selects bytes from the second vector.  By adding 16, I am keeping the offset the same, but I am making it pull from the second vector argument to SHUFB!  This becomes useful later on ;) 


We then take that result and use SHUFB to rearrange the bytes into another shuffle mask.  One that will not only extract the right bytes from the right places, but it will also put the result in just the right position to later be combined with a SELB.  The 8 masks below should demonstrate what I am trying to do.  Technically you only need 2 masks since the 0 fields could be anything


{ 3,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }

{ 0,0,11,15,0,0,0,0,0,0,0,0,0,0,0,0 }

{ 0,0,0,0,3,7,0,0,0,0,0,0,0,0,0,0 }

{ 0,0,0,0,0,0,11,15,0,0,0,0,0,0,0,0 }

{ 0,0,0,0,0,0,0,0,3,7,0,0,0,0,0,0 }

{ 0,0,0,0,0,0,0,0,0,0,11,15,0,0,0,0 }

{ 0,0,0,0,0,0,0,0,0,0,0,0,3,7,0,0 }

{ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,11,15 }


Next we apply those shuffle masks to the actual loaded data.  If V is a value we want, this gives the following sample result


V1, V2,  X,   X,   X, X, X, X, X, X, X, X, X, X, X, X

  X,  X,  V3, V4,  X, X, X, X, X, X, X, X, X, X, X, X


As you can see, the values in the vectors line up just right so that we can select using SELB.   The above example only combines 4 values but the full version shown below combines all 16 loaded pixels into one vector.


The results are shown below.  My assembly isn’t optimized yet but already it is a huge improvement over the C++ code.  I went from 81 cycles to 50 cycles, which is significant considering the number of times my loop runs.  As an added bonus, the even pipeline is now free enough to schedule the 99% even instructions that make up the lighting code!
















































I’m not saying that beating the compiler is an accomplishment, because it very rarely is.  I’m also not implying that I came up with something new that no one in the history of the universe has ever done.  I just think its a kinda cool trick and felt like sharing.  Hopefully somebody somewhere will find it useful


EDIT: old version posted above.  I cut another 4 cycles off the loop and I already have some ideas to get it down even further.


EDIT: a lot of you on twitter have pointed out that packing 16 pixels in a vec_uchar16 is usually a Very Bad Thing as the SPU byte instruction set is rather anemic.  This is true but the idea is custom designed to go with something like this.  Its almost all even instructions and can be used to calculate the length of rays 16 pixels at a time.  I’m not saying it will be useful in your everyday life, but rather its just an interesting trick that may save you someday :)  Besides, if you’re doing math, then just pack 8 shorts or 4 ints.  The technique still works, I’m only using char because I can

0022.Aug.24