C-C-C-Combo breaker!!1!
C-C-C-Combo breaker!!1!
n00b tip: Combining writes
Look, I am well aware that some of the things I post can be a little... errrr... specific. I only do it because I want to encourage others to do weird or non-obvious stuff as well. The SPUs may not have the insanely awesome ARM ISA, but thats no excuse to not see what you can do with alternate approaches.
As I hinted in my last post, I have an unhealthy obsession with scalar writes. The SPUs weren’t designed for scalar anything, and this has an interesting effect on how consecutive scalar writes are handled. Lets take a look at some really simple code
void BadWrites(char *ptr, int index1, int index2, int index3, char val1, char val2, char val3)
{
ptr[index1] = val1;
ptr[index2] = val2;
ptr[index3] = val3;
}
Looks simple enough. Now lets take a look at it in a screenshot of a static analyzer

Whoa! Actually the situation is worse that it appears. In the case where we are storing and then reading back in from the same address, there is an additional stall not shown above. I assume this is either because those stalls aren’t statically predictable (likely) or because the SPU hardware has advanced magic optimizations where the value doesn’t have to go through to memory (very unlikely). Why is the code gen so bad? Is the compiler crap? Yes. However, there is an actual valid reason for it. Lets say you computed two scalars that you want to write out. Because they are scalars, you have to load in the vectors that they live in, insert the value you computed in the right spot, and then write it back out. But what happens if those two scalars happen to live in the same vector? You get something like this
1) load in the current vector values
[8, 6, 7, 5] and [8, 6, 7, 5]
2)insert the scalars you calculated at the correct spot
[8, 3, 7, 5] and [8, 6, 1, 5]
3)write out the first vector, and then wrote out the second vector
See what I did there? The final value of the vector in LS will be [8, 6, 1, 5]. We totally overwrote and lost the first scalar we wanted to write. Because of this situation, the compiler must be conservative and load the existing first vector, write the first vector to memory, load the second vector, write the second vector, etc. It screws us in 2 ways. We can’t load the existing value of each vector ahead of time, and we can’t do consecutive writes without reading in between. Aarrgghh. However, there is a “different” way to go about it. Depending on how odd or even bound you are, it may work out advantageously for you.
The idea is dead simple. You go ahead and do whatever you want like you can guarantee scalars don’t occur in the same vector, Then at the last moment right before you write, you use precalculated masks to combine vectors if 2 values are in the same vector, or return the original vector otherwise. Its also possible to do it in a nice branch free way. You want it to look like this. Given vectors a = [8, 3, 7, 5] and b = [8, 6, 1, 5], if a and b are the same address, then a = [8, 3, 1, 5] and b = [8, 3, 1, 5], otherwise they dont change. The reason you are free to go ahead and write both freely is that if vector a and vector b both contain the same values, nothing is lost if one “overwrites” the other. If a and b are at different addresses, its safe to write them anyway. I will now show you the intrinsics version for the three value version. I also have a 2 value version but its only one cycle shorter due to scheduling holes.
// limitation: all addresses must be non-increasing or non-decreasing. 16, 18, 17 isnt allowed
// great news! most of this can be done offline or precomputed as a table of 16 vectors
// offset in the vector (0, 4, 8, or 12)
const vec_uint4 vec_offset_2 = spu_and(load_addr_2, 0x0000000F);
const vec_uint4 vec_offset_3 = spu_and(load_addr_3, 0x0000000F);
// vector num for seeing if 2 values occur in the same vector
const vec_uint4 vec_num_1 = spu_and(load_addr_1, 0xFFFFFFF0);
const vec_uint4 vec_num_2 = spu_and(load_addr_2, 0xFFFFFFF0);
const vec_uint4 vec_num_3 = spu_and(load_addr_3, 0xFFFFFFF0);
// if it is the same vector
const vec_uint4 is_same_vector_1 = spu_cmpeq(vec_num_1, vec_num_2);
const vec_uint4 is_same_vector_2 = spu_cmpeq(vec_num_2, vec_num_3);
const vec_uint4 shuffle_table_offset_1 = spu_sub(sixteen, vec_offset_2);
const vec_uint4 shuffle_table_offset_2 = spu_sub(sixteen, vec_offset_3);
// rot_mask_1 is {0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000}
// rot_mask_2 is {0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF}
const vec_uint4 shuffle_table_entry_1 = (vec_uint4)si_rotqby((qword)rot_mask_1, (qword)shuffle_table_offset_1);
const vec_uint4 shuffle_table_entry_2 = (vec_uint4)si_rotqby((qword)rot_mask_2, (qword)shuffle_table_offset_2);
const vec_uint4 vector_combinor_1 = spu_and(shuffle_table_entry_1, is_same_vector_1);
const vec_uint4 vector_combinor_2 = spu_and(shuffle_table_entry_2, is_same_vector_2);
// combine vectors and freely write without worrying what stomps what
const vec_uint4 vec_to_write_1 = spu_sel(vals_1, vals_2, vector_combinor_1);
const vec_uint4 vec_to_write_2 = spu_sel(vals_2, vec_to_write_1, is_same_vector_1);
const vec_uint4 vec_to_write_3 = spu_sel(vals_3, vec_to_write_2, vector_combinor_2);
Now lets take a look at how the two and three value versions schedule (in a vacuum). Ignore the final adds at 15 and 17. They are debug code.
Not bad at all. If you are doing stuff like this (2D image processing) then you are probably light on the even pipeline and very heavy on the odd. This may schedule right in for you. If not, like I said, the values can all be precomputed into a table of 16 vectors, pretty much collapsing all the above code into a load or two that can be done in parallel with whatever else you are doing. Pretty cool, eh? I use stuff like this a lot when rasterizing lines. I wanted to use it in the PixelJunk Shooter 2 lighting system but it never really made it in. Oh well, maybe next game!
0022.Nov.27