Neighborly get-togethers

 

n00b tip: Wont you be my neighbor?

 

Disclaimer: I mean no disrespect towards Fred McFeely Rogers.  I grew up watching his show and he taught me all about the patience and tolerance that to this day I still don’t have.  I only chose the photo because this post is all about...  wait for it... neighbors!


Awhile back I started thinking about ways to average the 8 pixels neighboring some pixel.  I like this problem a lot because its quite basic/fundamental, but at the same time potentially has a number of interesting solutions.  If you’ve been doing SPU programming for any length of time, everything I say here will be completely obvious to you.  But for people new to the SPUs, hopefully it will keep them from writing some C++ SPU job that will probably work but isn’t anything you want iterating over millions of pixels. Or maybe it is.  Depends on your personality, I guess.


Lets assume we don’t want to be extracting scalars from vectors or doing things one at a time.  Step one is to load the pixels in some way that makes life easy.  Notice the following:


                               original vec: { 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }

original vec rotated right 1 byte: { 31, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 }

  original vec rotated left 1 byte: { 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 16 }


With the exception of the bytes on the end, we now have each pixel vertically aligned with its left and right neighbor, meaning we can process 16 pixels at a time.  To fix up the ends, you can use shufb to not only simulate the effect of vector rotation, but also to steal a byte from the previous and next vectors.  Your left and right shuffle masks will look like this


  left neighbor mask: { 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 }

right neighbor mask: { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }


I wont go into the finer details of SHUFB, but that first mask takes the last byte of vector 1 and the first 15 bytes of vector 2.  The second mask takes the last 15 bytes of vector 1 and the first byte of vector 2.  In other words, exactly what we want.  Now all we have to do is add.


The clever among you will notice two things wrong with that last statement.  First of all, 255 + 255 + 255 = 253 ( because the result is stored back to a byte ).  Unfortunately, in this case we can’t do any obvious overflow voodoo magic with CG ( carry generate instruction, not the NVidia shader language ) because CG only works on int and unsigned int vectors. 


The other problem is the SPU has no byte addition instruction.  If you don’t believe me, check out the code below.  If you use operator+ on two unsigned char vectors, you get the following lovely piece of code


lqr     $21,0x2df40

lqr     $20,0x2df00

andhi   $3,$21,-256    

ah      $19,$20,$21    

ah      $17,$3,$20

selb    $16,$17,$19,$18


Thats not to say that there are no instructions that add bytes together.  There are, but they don’t work like you think.  SUMB takes every 4 consecutive bytes from vector 1, adds them together, and puts the sums into the even elements of the destination short vec.  SUMB also takes every 4 consecutive bytes from vector 2, adds them together, and puts the sums into the odd elements of the destination short vec.  You do add 8 numbers simultaneously, but you still have to shift and you still have to do a lot of weird packing/unpacking.  I haven’t totally given up on the idea of using this instruction, but I have temporarily moved on to trying other things.  I still think there is good potential here!


edit: Using two shuffles you can get the data in a format where you can use SUMB to sum up all neighbors at once into shorts.  However, since the two operands to the shuffle need to be the vector and the vector below it, it doesn’t work so well for pixel windows that span horizontally across loaded vectors.  The problem is fixable but it would require an additional shuffle that depends on the result of the previous shuffle, or some rotates and selects.  For this reason, I’m not sure its the best way to go, but I’m not that clever so I could be missing something.


One way to “solve” the problem is to realize that because we will be averaging 8 items, the weight for each item is 1/8, or item >> 3.  However, consider the scenario where you have a pixel surrounded by all 7’s.  In non-float land, 7/8 = 0, therefore making the average of all the pixels in the neighborhood 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 0.  Maybe thats OK if you know our data will always be >= 8, but we still don’t have a fast way to sum up all those weighted pixels.  The obvious thing to do is quickly convert to short, add and shift, and put the result back in bytes but that seems excessive.  Oh well, if all else fails, we can always fall back to doing it that way.


There is another interesting instruction called AVGB, meaning average bytes.  From the name alone, you know this one is going to be damn useful.  It takes 2 vectors as operands, adds corresponding elements + 1, and then shifts right by 1.   For example, lets say the first elements of the 2 input vectors are 11 and 20.  AVGB will give us ( 11 + 20 + 1 ) >> 1 or 16.  Because there are an even number of pixels in the neighborhood


0  1  2

3  P  4

5  6  7


we can express the average as avg( avg( avg( 0, 1 ), avg( 2, 3 ) ), avg( avg( 4, 5 ), avg( 6, 7 ) ) ).  So thats it, with a few shuffle masks and some AVGB instructions, we can average neighbors for 16 pixels at once simultaneously, and most importantly we can do it in a way that minimizes loads and random accesses. 


I make no guarantees that this is the optimal way to do anything.  It all depends on your data and exactly what your needs are.  Maybe the precision is problematic for your use.  I am currently toying around with some alternative ideas that may or may not pan out.  If anything cool comes out of it, I’ll be sure to post.  Also, feel free to leave your thoughts and ideas in the comments section below.  I’d be curious what other techniques some of the more creative people in the industry are using. 


One last thing to note: don’t forget to reuse your loaded vectors!  To process a particular vector, you also need the vector that comes before it and the one that comes after it.  On your next iteration, the current vector becomes previous, and the next vector becomes current.  You can even hide the cost of loading the next next vector somewhere in the previous iteration, giving you completely free loads ( your mileage may vary, depends on what else you do in your loop ).

 

0022.Apr.24