<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:iweb="http://www.apple.com/iweb" version="2.0">
  <channel>
    <title>6cycles</title>
    <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/6cycles.html</link>
    <description>Because your data is already being transformed while everyone else is missing L2 and hammering shared memory&lt;br/&gt;&lt;br/&gt;</description>
    <generator>iWeb 3.0.2</generator>
    <image>
      <url>http://6cycles.maisonikkoku.com/6Cycles/6cycles/6cycles_files/cell_core.jpg</url>
      <title>6cycles</title>
      <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/6cycles.html</link>
    </image>
    <item>
      <title>rant: C# is for tools, and when I say tools I mean both that it is an adequate language for writing offline tools in, but also in that the people who use it for performance critical code are themselves tools in the douchebag sense of the word. </title>
      <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/11/27_rant__C_is_for_tools,_and_when_I_say_for_tools_I_mean_both_that_it_is_an_adequate_language_for_writing_tools_in,_and_that_the_people_who_use_it_for_performance_critical_code_are_themselves_tools_in_the_douche_sense_of_the_word..html</link>
      <guid isPermaLink="false">a5c69c58-0fa5-4c9f-b14c-86303c37d24d</guid>
      <pubDate>Sat, 27 Nov 2010 12:26:22 +0900</pubDate>
      <description>Update: this post is still in alpha.  I have taken lots of suggestions from people and some help from Miguel and made a bunch of changes, but its still not shippable.  If you have any ideas / corrections, please feel free to email me&lt;br/&gt;&lt;br/&gt;I try to avoid posting too many rants (this is my first one) because rants don’t help people write better SPU code, unless the person ranting in Mike Acton.  That being said, I think this subject deserves one.  &lt;br/&gt;&lt;br/&gt;So, why would I even touch a language that so obviously is not designed for people like me?  I think its important to know your enemies so I decided to learn a little bit and see why I pre-decided to dislike the language so much ;)  I would never own a Windows computer and I won’t use MS Office.  However, I love my Xbox 360 and Xbox Live with all my heart, so while I may be biased against Microsoft, it’s not impossible for me to love something they make if it doesn’t suck.  Also, there are people I know who think C# is a replacement for C and can be used for high performance scientific computing and games, and even Microsoft seems to be pushing this.  Unfortunately, it isn’t the case.  I’d like to show a few reasons why.&lt;br/&gt;&lt;br/&gt;Oh, and don’t go making arguments like I have a problem with the implementation and not the language.  A language is only as good as the three implementations that actually exist.&lt;br/&gt;&lt;br/&gt;Prologue: What C# Does Better Than C++, But Not As Well As C&lt;br/&gt;&lt;br/&gt;I will say that while the language design goals are clearly not aligned with my my own list of important language features, the language does have some good points for people who like this kind of stuff.  The standard library seems to come with every possible function a person could ever possibly need to do certain sets of tasks.  Even I was amazed at how few helper functions I had to write.  The initialization syntax provides some conveniences that aren’t present in C++.  The syntax ease of use for things like delegates and lambda is utterly insane, and I don’t just mean compared to the nightmare that is boost.  The lack of unions, pointers, and some other insanely useful things should, in theory, help the compiler better sort out aliasing.  Not allowing multiple inheritance is a great idea for people who use all that object oriented crap.  The simple language design simplifies the compiler, and it seems a little more robust in checking compile-time errors. Finally, as someone who doesn’t like the parts of ObjectiveC that are not actual C, MonoTouch is a great way to go for people developing for iPhone, iPad, or whatever other touch devices it supports.  I am grateful to Miguel and the whole mono team for providing an alternative to ObjectiveC.&lt;br/&gt;&lt;br/&gt;Part 0: Setup&lt;br/&gt;&lt;br/&gt;running multiple times, turning off JIT in debugger, mono vs VS, what timers used, link to kalin’s asm gen page, checking for denormals&lt;br/&gt;&lt;br/&gt;Part 1: Vector Unfriendly&lt;br/&gt;&lt;br/&gt;For me, this is an absolute language killer.  The first thing I wanted to do was see how it performed in an area I am familiar with: SIMD.  After searching for vector intrinsics for awhile and completely failing, someone told me you have to install something called XNA to get the vector types.  So I did, and what I found completely shocked me.  It seems like the C# XNA vectors were actually scalars.  How the hell could this be?  I looked up how to enable SIMD types on the net but most of the info was about the XNA for C++.  I tried to write my own vector struct but couldn’t find the C# equivalent of __attribute__((aligned(16))).  Do you know why?  Because their VM can’t properly handle alignment.  Why?  I can only guess.  A coworker thought maybe they are doing some garbage collector optimization that doesn’t allow them to maintain the alignment of objects.  No wonder their vector class is really a scalar class.  So right off the bat I knew the language would be useless for me.  &lt;br/&gt;&lt;br/&gt;I checked out Mono’s SIMD stuff both at work on XP and at home under OSX and Ubuntu.  Amazingly, the SIMD stuff was orders of magnitude slower than scalar.  Finding that hard to believe, I traced through the SIMD source code and found a ton of comments like “TODO: keep things in registers” and “constructors are an absolute nightmare.”  Miguel from mono gave me some advice that fixed the situation for raw vectors but not structs.  By skipping monodevelop, compiling via command line, and adding --llvm, the code gen for mono improved greatly.  In fact, mono pretty much destroyed the Visual compiler.  Please see my mono section below for more details and analysis of the test.&lt;br/&gt;&lt;br/&gt;So basically for people wishing to use vector ops in Visual Studio, you are out of luck (and 64 bit versions of Windows using vector HW for scalar ops doesn’t count).  Unfortunately, my C# nightmare was just beginning.&lt;br/&gt;&lt;br/&gt;Part 2: Constructors and Inlining&lt;br/&gt;&lt;br/&gt;I eventually gave up, discarded my Vector1 class (what is the point without alignment), and decided to rewrite Vector4 in the ugliest way possible: using scalars.  The constructor had 2 versions, one taking 4 floats and one taking 1 float, and I also defined a multiplication operator.  For performance reasons I also compared it to doing operations on 4 loose floats that lived in the test function itself.  What I found both shocked and horrified me.&lt;br/&gt;&lt;br/&gt;	public struct Vector4&lt;br/&gt;	{&lt;br/&gt;		float mX, mY, mZ, mW;&lt;br/&gt;&lt;br/&gt;		public Vector4( float x_in, float y_in, float z_in, float w_in )&lt;br/&gt;		{&lt;br/&gt;			mX = x_in; &lt;br/&gt;			mY = y_in; &lt;br/&gt;			mZ = z_in; &lt;br/&gt;			mW = w_in;	&lt;br/&gt;		}&lt;br/&gt;&lt;br/&gt;		public Vector4( float x_in )&lt;br/&gt;		{&lt;br/&gt;			mX = x_in; &lt;br/&gt;			mY = x_in; &lt;br/&gt;			mZ = x_in; &lt;br/&gt;			mW = x_in;		&lt;br/&gt;		}&lt;br/&gt;&lt;br/&gt;		public static v4 operator *( v4 vec1, v4 vec2 )&lt;br/&gt;		{&lt;br/&gt;			return new v4( vec1.mX * vec2.mX, vec1.mY * vec2.mY, vec1.mZ * vec2.mZ, vec1.mW * vec2.mW );	&lt;br/&gt;		}&lt;br/&gt;	}&lt;br/&gt;&lt;br/&gt;(mono numbers.  VS numbers were slightly better but also terrible)&lt;br/&gt;Vector4 test took 33 ms for 1000000 multiplies ( 0.0000330 ms per call avg )&lt;br/&gt;Loose float test took 4 ms for 1000000 multiplies ( 0.0000040 ms per call avg )&lt;br/&gt;&lt;br/&gt;Now how curious is that?  My first instinct is was to look at the assembly (the real meaning of assembly, not what C# people call assembly) and see what the hell is going on. However, mono only supports disassembly view in the hard debugger, and the hard debugger isn’t supported on OSX or Windows, so I went back to VS for the rest of the test.  It turns out that there were a few things going wrong, but the main culprits were lack of inlining and non-optimization of copies, especially with floats.  I will tackle inlining first.  When deciding to inline, the size of the IL seems like it is used as an easy out, and functions larger than 32 bytes are never inlined.  The size of the would-be generated code is important too.  JIT 3.5 SP1 uses the following rules:&lt;br/&gt;&lt;br/&gt;	•	Estimate the size of the call site if the method were not inlined.&lt;br/&gt;	•	Estimate the size of the call site if it were inlined (estimate based on the IL, using a simple state machine (Markov Model), created using lots of real data to form this estimator logic)&lt;br/&gt;	•	Compute a multiplier. By default it is 1&lt;br/&gt;	•	Increase the multiplier if the code is in a loop (the current heuristic bumps it to 5 in a loop)&lt;br/&gt;	•	Increase the multiplier if it looks like struct optimizations will kick in.&lt;br/&gt;	•	If &lt;a href=&quot;http://qwiki.q-games.com/InlineSize&quot;&gt;InlineSize&lt;/a&gt; &amp;lt;= &lt;a href=&quot;http://qwiki.q-games.com/NonInlineSize&quot;&gt;NonInlineSize&lt;/a&gt; * Multiplier do the inlining.&lt;br/&gt;&lt;br/&gt;I also found this list:&lt;br/&gt;&lt;br/&gt;• Methods that are greater than 32 bytes of IL will not be inlined.&lt;br/&gt;• Virtual functions are not inlined.&lt;br/&gt;• Methods that have complex flow control will not be in-lined. Complex flow control is any flow control other than if/then/else; in this case, switch or while.&lt;br/&gt;• Methods that contain exception-handling blocks are not inlined, though methods that throw exceptions are still candidates for inlining.&lt;br/&gt;• If any of the method's formal arguments are structs, the method will not be inlined.&lt;br/&gt;&lt;br/&gt;Go ahead and let that very last one sink in.  Structs, even if they are small enough to fit in a register and passed as such, disqualify a function from inlining?  I can’t promise you it’s up to date or accurate, but it definitely explains some of the stuff I was seeing, especially since I was calling the mul operator in a loop which should have “bumped me to 5.”  If Spinal Tap was a compiler, it would have bumped me to 11.  I don’t know what the “struct optimization” mentioned in the first list is, but I sure wasn’t seeing it.&lt;br/&gt;&lt;br/&gt;The Xenon rules are even more restrictive.  They cut you off at 16 bytes of IL, and disqualify you for having local variables, taking float args, and returning floats.  Thats all very counter intuitive for me, especially because lots of local const temporaries actually help GCC.  Lets go ahead and test the float hypothesis.  To do this, I made three vector classes that look like the above struct but with the types swapped: one int version, one float version, and one double version.  I then called the multiply operator in a loop (spoiler: the int version inlined fine, absolutely nothing in the float version inlined, and doubles were faster than floats but only because they inlined)&lt;br/&gt;&lt;br/&gt;int version ( everything is inlined except the call to Random.Next )&lt;br/&gt;&lt;br/&gt;mov         ecx,ebx &lt;br/&gt;mov         eax,dword ptr [ecx] &lt;br/&gt;call        dword ptr [eax+3Ch] &lt;br/&gt;mov         edx,eax &lt;br/&gt;mov         eax,edx &lt;br/&gt;imul        eax,dword ptr [ebp-18h] &lt;br/&gt;mov         dword ptr [ebp-18h],eax &lt;br/&gt;mov         eax,edx &lt;br/&gt;imul        eax,esi &lt;br/&gt;mov         esi,eax &lt;br/&gt;mov         eax,edx &lt;br/&gt;imul        eax,dword ptr [ebp-1Ch] &lt;br/&gt;mov         dword ptr [ebp-1Ch],eax &lt;br/&gt;imul        edx,dword ptr [ebp-20h] &lt;br/&gt;mov         dword ptr [ebp-20h],edx &lt;br/&gt;inc         edi  &lt;br/&gt;cmp         edi,0F4240h &lt;br/&gt;jl          000000CF &lt;br/&gt;&lt;br/&gt;float version ( nothing at all is inlined.  No constructors, no operators, nothing )&lt;br/&gt;TODO: remember to mail myself the float version to include here&lt;br/&gt;&lt;br/&gt;double version ( perfectly inlined, but whats with all those movqs? )&lt;br/&gt;&lt;br/&gt;lea         edi,[esp+000001E0h] &lt;br/&gt;lea         esi,[esp] &lt;br/&gt;movq        xmm0,mmword ptr [esi] &lt;br/&gt;movq        mmword ptr [edi],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+8] &lt;br/&gt;movq        mmword ptr [edi+8],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+10h] &lt;br/&gt;movq        mmword ptr [edi+10h],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+18h] &lt;br/&gt;movq        mmword ptr [edi+18h],xmm0 &lt;br/&gt;mov         ecx,ebx &lt;br/&gt;mov         eax,dword ptr [ecx] &lt;br/&gt;call        dword ptr [eax+48h] &lt;br/&gt;fstp        qword ptr [esp+00000278h] &lt;br/&gt;fld         qword ptr [esp+00000278h] &lt;br/&gt;lea         edi,[esp+000001C0h] &lt;br/&gt;pxor        xmm0,xmm0 &lt;br/&gt;movq        mmword ptr [edi],xmm0 &lt;br/&gt;movq        mmword ptr [edi+8],xmm0 &lt;br/&gt;movq        mmword ptr [edi+10h],xmm0 &lt;br/&gt;movq        mmword ptr [edi+18h],xmm0 &lt;br/&gt;fst         qword ptr [esp+000001C0h] &lt;br/&gt;fst         qword ptr [esp+000001C8h] &lt;br/&gt;fst         qword ptr [esp+000001D0h] &lt;br/&gt;fstp        qword ptr [esp+000001D8h] &lt;br/&gt;lea         edi,[esp+00000200h] &lt;br/&gt;lea         esi,[esp+000001E0h] &lt;br/&gt;movq        xmm0,mmword ptr [esi] &lt;br/&gt;movq        mmword ptr [edi],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+8] &lt;br/&gt;movq        mmword ptr [edi+8],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+10h] &lt;br/&gt;movq        mmword ptr [edi+10h],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+18h] &lt;br/&gt;movq        mmword ptr [edi+18h],xmm0 &lt;br/&gt;lea         edi,[esp+00000220h] &lt;br/&gt;lea         esi,[esp+000001C0h] &lt;br/&gt;movq        xmm0,mmword ptr [esi] &lt;br/&gt;movq        mmword ptr [edi],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+8] &lt;br/&gt;movq        mmword ptr [edi+8],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+10h] &lt;br/&gt;movq        mmword ptr [edi+10h],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+18h] &lt;br/&gt;movq        mmword ptr [edi+18h],xmm0 &lt;br/&gt;lea         edi,[esp+00000240h] &lt;br/&gt;pxor        xmm0,xmm0 &lt;br/&gt;movq        mmword ptr [edi],xmm0 &lt;br/&gt;movq        mmword ptr [edi+8],xmm0 &lt;br/&gt;movq        mmword ptr [edi+10h],xmm0 &lt;br/&gt;movq        mmword ptr [edi+18h],xmm0 &lt;br/&gt;fld         qword ptr [esp+00000200h] &lt;br/&gt;fmul        qword ptr [esp+00000220h] &lt;br/&gt;fld         qword ptr [esp+00000208h] &lt;br/&gt;fmul        qword ptr [esp+00000228h] &lt;br/&gt;fld         qword ptr [esp+00000210h] &lt;br/&gt;fmul        qword ptr [esp+00000230h] &lt;br/&gt;fld         qword ptr [esp+00000218h] &lt;br/&gt;fmul        qword ptr [esp+00000238h] &lt;br/&gt;fxch        st(3) &lt;br/&gt;fstp        qword ptr [esp+00000240h] &lt;br/&gt;fxch        st(1) &lt;br/&gt;fstp        qword ptr [esp+00000248h] &lt;br/&gt;fstp        qword ptr [esp+00000250h] &lt;br/&gt;fstp        qword ptr [esp+00000258h] &lt;br/&gt;lea         edi,[esp] &lt;br/&gt;lea         esi,[esp+00000240h] &lt;br/&gt;movq        xmm0,mmword ptr [esi] &lt;br/&gt;movq        mmword ptr [edi],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+8] &lt;br/&gt;movq        mmword ptr [edi+8],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+10h] &lt;br/&gt;movq        mmword ptr [edi+10h],xmm0 &lt;br/&gt;movq        xmm0,mmword ptr [esi+18h] &lt;br/&gt;movq        mmword ptr [edi+18h],xmm0 &lt;br/&gt;inc         dword ptr [esp+30h] &lt;br/&gt;cmp         dword ptr [esp+30h],0F4240h &lt;br/&gt;jl          0000047F &lt;br/&gt;&lt;br/&gt;So thats what I get when putting some scalars in a struct and trying to multiply them.  Obviously something else is going on.  I have to say, it all made me miss force_inline.  Just because some programmers think that using force_inline on every function makes code moar faster doesn’t mean you should take away the option from programmers that know what they are doing and understand whats semantically going on better than some JIT compiler.  I think the “why do you care, jut trust the compiler” mentality of many C# programmers is pretty crappy...&lt;br/&gt;&lt;br/&gt;Part 3: Mono: Not So Smooth Operators&lt;br/&gt;&lt;br/&gt;I am not picking in mono.  Actually quite the opposite.  The mono results were better then what I got in Visual, so thats why Iam using them.&lt;br/&gt;&lt;br/&gt;So I wanted to look further into why the vector operators in my struct took so much time in mono so I tried a little experiment.  First of all, why should the compiler do when presented with a vector add in a loop?  Here are 2 examples from the beagleboard gcc compiler (using ARM NEON THUMB mode instructions) and ppu gcc compiler (PS3 Linux).&lt;br/&gt;&lt;br/&gt;; ARM NEON gcc (thumb)&lt;br/&gt;subs       r0,#1&lt;br/&gt;vadd.f32   q0,q0,q1&lt;br/&gt;bne        0x811DC9CE&lt;br/&gt;&lt;br/&gt;; PPU gcc&lt;br/&gt;000C83E0 mtspr   ctr,r0&lt;br/&gt;000C83F0 vaddfp  v1,v1,v0                      &lt;br/&gt;000C83F4 bdnz+   0x000C83F0 &lt;br/&gt;&lt;br/&gt;Its exactly what you would expect.  I think the ARM one could have been done slightly better in non-THUMB mode but its not horrible.  The gcc one is dead on.  Now lets look what mono produces when using the command line and LLVM with raw vectors (Mono.Simd.Vector4f + Mono.Simd.Vector4f)&lt;br/&gt;&lt;br/&gt;DMCS_OPTIONS = -optimize+ -checked- -warn:4 -r:Mono.Simd &lt;br/&gt;MONO_OPTIONS = --llvm --optimize=all --gc=sgen&lt;br/&gt;MONO_SHOW_ASM_OPTIONS = --llvm --optimize=all --gc=sgen -v -v&lt;br/&gt;&lt;br/&gt;&amp;lt;BB&gt;:8&lt;br/&gt; 560:   0f 10 85 78 ff ff ff    movups -0x88(%ebp),%xmm0&lt;br/&gt; 567:   0f 10 4d 88             movups -0x78(%ebp),%xmm1&lt;br/&gt; 56b:   0f 58 c1                addps  %xmm1,%xmm0&lt;br/&gt; 56e:   46                      inc    %esi&lt;br/&gt; 56f:   0f 11 85 78 ff ff ff    movups %xmm0,-0x88(%ebp)&lt;br/&gt;&amp;lt;BB&gt;:7&lt;br/&gt; 576:   81 fe 40 42 0f 00       cmp    $0xf4240,%esi&lt;br/&gt; 57c:   7c e2                   jl     560 &amp;lt;dTests_MainClass_TestScalarPerf+0x560&gt;&lt;br/&gt;&lt;br/&gt;I am no x86 guy but I noticed a few things.  First, x86 instructions aren’t fixed length, and that makes me gag.  Second, I am not familiar with the architecture so much and maybe there is a reason, but I find it weird that those movups are in there.  Can it not just keep it in the registers it is in and not have to move stuff around?  Also the way those moves are using that base offset notation, they look suspiciously like loads of parameters off the stack.  WTF!  Aside from that, the code looks quite reasonable.  Let’s try to kill that reasonableness!  I wrote the following vector struct&lt;br/&gt;&lt;br/&gt;public struct v4simd&lt;br/&gt;{&lt;br/&gt;    public Mono.Simd.Vector4f mVec;&lt;br/&gt;&lt;br/&gt;    public v4simd( float xIn, float yIn, float zIn, float wIn )&lt;br/&gt;    {&lt;br/&gt;        mVec = new Mono.Simd.Vector4f( xIn, yIn, zIn, wIn );    &lt;br/&gt;    }&lt;br/&gt;&lt;br/&gt;    public v4simd( float xIn )&lt;br/&gt;    {&lt;br/&gt;        mVec = new Mono.Simd.Vector4f( xIn );   &lt;br/&gt;    }&lt;br/&gt;&lt;br/&gt;    public v4simd( Mono.Simd.Vector4f vec )&lt;br/&gt;    {&lt;br/&gt;        mVec = vec; &lt;br/&gt;    }&lt;br/&gt;&lt;br/&gt;    public static v4simd operator +( v4simd vec1, v4simd vec2 )&lt;br/&gt;    {&lt;br/&gt;        return new v4simd( vec1.mVec + vec2.mVec ); &lt;br/&gt;    }&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;    public static void StaticAddFunc(out v4simd v3, ref v4simd v1, ref v4simd v2)&lt;br/&gt;    {&lt;br/&gt;        v3.mVec = v1.mVec + v2.mVec;&lt;br/&gt;    }&lt;br/&gt;&lt;br/&gt;    public static void StaticAddFunc2(out v4simd v3, v4simd v1, v4simd v2)&lt;br/&gt;    {&lt;br/&gt;        v3.mVec = v1.mVec + v2.mVec;&lt;br/&gt;    }&lt;br/&gt;&lt;br/&gt;    public static v4simd StaticAddFunc3(v4simd v1, v4simd v2)&lt;br/&gt;    {&lt;br/&gt;        return new v4simd( v1.mVec + v2.mVec );&lt;br/&gt;    }&lt;br/&gt;}&lt;br/&gt;&lt;br/&gt;// results&lt;br/&gt;for( int t = 0; t &amp;lt; times; ++t )&lt;br/&gt;{&lt;br/&gt;    vec1 = vec1 + vec2;                                   // 20ms&lt;br/&gt;    //v4simd.StaticAddFunc(out vec1, ref vec1, ref vec2); // 6ms&lt;br/&gt;    //v4simd.StaticAddFunc2(out vec1, vec1, vec2);        // 13ms&lt;br/&gt;    //vec1 = v4simd.StaticAddFunc3(vec1, vec2);           // 20ms&lt;br/&gt;}&lt;br/&gt;&lt;br/&gt;It seems like the second I wrap the mono vector type in a struct, everything goes out the window.  The operator and StaticAddFunc3 took the most time at about 20 ms.  StaticAddFunc2 took 14 ms, and StaticAddFunc took 6ms which is still slower than the scalar version.  Wow.  Counterintuitively, the fewer references I used, the faster it was.  It seems related to an inability to keep things in registers and non-optimization of copies the way I would expect GCC to do.  This is a very unfortunate result because it stops you from chaining add and other things in longer expressions.  Maybe its better to just not wrap the raw vectors in a vector class at all.  Here is the assembly from mono:&lt;br/&gt;&lt;br/&gt;;static add func 1 (6ms):&lt;br/&gt;&lt;br/&gt; &amp;lt;BB&gt;:8&lt;br/&gt; 4a0:   8d 45 c8                lea    -0x38(%ebp),%eax&lt;br/&gt; 4a3:   8d 4d c8                lea    -0x38(%ebp),%ecx&lt;br/&gt; 4a6:   8d 55 d8                lea    -0x28(%ebp),%edx&lt;br/&gt; 4a9:   83 ec 04                sub    $0x4,%esp&lt;br/&gt; 4ac:   52                      push   %edx&lt;br/&gt; 4ad:   51                      push   %ecx&lt;br/&gt; 4ae:   50                      push   %eax&lt;br/&gt; 4af:   e8 8c 00 00 00          call   540 &amp;lt;dTests_MainClass_TestScalarPerf+0x540&gt;&lt;br/&gt; 4b4:   83 c4 10                add    $0x10,%esp&lt;br/&gt; 4b7:   46                      inc    %esi&lt;br/&gt;&amp;lt;BB&gt;:7&lt;br/&gt; 4b8:   81 fe 40 42 0f 00       cmp    $0xf4240,%esi&lt;br/&gt; 4be:   7c e0                   jl     4a0 &amp;lt;dTests_MainClass_TestScalarPerf+0x4a0&gt;&lt;br/&gt;&lt;br/&gt; ; static add func 2 (14ms):&lt;br/&gt;&lt;br/&gt;&amp;lt;BB&gt;:8&lt;br/&gt; 340:   8d 45 c8                lea    -0x38(%ebp),%eax&lt;br/&gt; 343:   8b 4d c8                mov    -0x38(%ebp),%ecx&lt;br/&gt; 346:   89 4d e8                mov    %ecx,-0x18(%ebp)&lt;br/&gt; 349:   8b 4d cc                mov    -0x34(%ebp),%ecx&lt;br/&gt; 34c:   89 4d ec                mov    %ecx,-0x14(%ebp)&lt;br/&gt; 34f:   8b 4d d0                mov    -0x30(%ebp),%ecx&lt;br/&gt; 352:   89 4d f0                mov    %ecx,-0x10(%ebp)&lt;br/&gt; 355:   8b 4d d4                mov    -0x2c(%ebp),%ecx&lt;br/&gt; 358:   89 4d f4                mov    %ecx,-0xc(%ebp)&lt;br/&gt; 35b:   83 ec 0c                sub    $0xc,%esp&lt;br/&gt; 35e:   83 ec 10                sub    $0x10,%esp&lt;br/&gt; 361:   8b 4d d8                mov    -0x28(%ebp),%ecx&lt;br/&gt; 364:   89 0c 24                mov    %ecx,(%esp)&lt;br/&gt; 367:   8b 4d dc                mov    -0x24(%ebp),%ecx&lt;br/&gt; 36a:   89 4c 24 04             mov    %ecx,0x4(%esp)&lt;br/&gt; 36e:   8b 4d e0                mov    -0x20(%ebp),%ecx&lt;br/&gt; 371:   89 4c 24 08             mov    %ecx,0x8(%esp)&lt;br/&gt; 375:   8b 4d e4                mov    -0x1c(%ebp),%ecx&lt;br/&gt; 378:   89 4c 24 0c             mov    %ecx,0xc(%esp)&lt;br/&gt; 37c:   83 ec 10                sub    $0x10,%esp&lt;br/&gt; 37f:   8b 4d e8                mov    -0x18(%ebp),%ecx&lt;br/&gt; 382:   89 0c 24                mov    %ecx,(%esp)&lt;br/&gt; 385:   8b 4d ec                mov    -0x14(%ebp),%ecx&lt;br/&gt; 388:   89 4c 24 04             mov    %ecx,0x4(%esp)&lt;br/&gt; 38c:   8b 4d f0                mov    -0x10(%ebp),%ecx&lt;br/&gt; 38f:   89 4c 24 08             mov    %ecx,0x8(%esp)&lt;br/&gt; 393:   8b 4d f4                mov    -0xc(%ebp),%ecx&lt;br/&gt; 396:   89 4c 24 0c             mov    %ecx,0xc(%esp)&lt;br/&gt; 39a:   50                      push   %eax&lt;br/&gt; 39b:   e8 88 02 00 00          call   628 &amp;lt;dTests_MainClass_TestScalarPerf+0x628&gt;&lt;br/&gt; 3a0:   83 c4 30                add    $0x30,%esp&lt;br/&gt; 3a3:   46                      inc    %esi&lt;br/&gt;&amp;lt;BB&gt;:7&lt;br/&gt; 3a4:   81 fe 40 42 0f 00       cmp    $0xf4240,%esi&lt;br/&gt; 3aa:   7c 94                   jl     340 &amp;lt;dTests_MainClass_TestScalarPerf+0x340&gt;&lt;br/&gt;&lt;br/&gt;; static add func 3 (20ms):&lt;br/&gt;&lt;br/&gt;&amp;lt;BB&gt;:8&lt;br/&gt; 340:   8b 45 c8                mov    -0x38(%ebp),%eax&lt;br/&gt; 343:   89 45 e8                mov    %eax,-0x18(%ebp)&lt;br/&gt; 346:   8b 45 cc                mov    -0x34(%ebp),%eax&lt;br/&gt; 349:   89 45 ec                mov    %eax,-0x14(%ebp)&lt;br/&gt; 34c:   8b 45 d0                mov    -0x30(%ebp),%eax&lt;br/&gt; 34f:   89 45 f0                mov    %eax,-0x10(%ebp)&lt;br/&gt; 352:   8b 45 d4                mov    -0x2c(%ebp),%eax&lt;br/&gt; 355:   89 45 f4                mov    %eax,-0xc(%ebp)&lt;br/&gt; 358:   8d 45 c8                lea    -0x38(%ebp),%eax&lt;br/&gt; 35b:   83 ec 0c                sub    $0xc,%esp&lt;br/&gt; 35e:   83 ec 10                sub    $0x10,%esp&lt;br/&gt; 361:   8b 4d d8                mov    -0x28(%ebp),%ecx&lt;br/&gt; 364:   89 0c 24                mov    %ecx,(%esp)&lt;br/&gt; 367:   8b 4d dc                mov    -0x24(%ebp),%ecx&lt;br/&gt; 36a:   89 4c 24 04             mov    %ecx,0x4(%esp)&lt;br/&gt; 36e:   8b 4d e0                mov    -0x20(%ebp),%ecx&lt;br/&gt; 371:   89 4c 24 08             mov    %ecx,0x8(%esp)&lt;br/&gt; 375:   8b 4d e4                mov    -0x1c(%ebp),%ecx&lt;br/&gt; 378:   89 4c 24 0c             mov    %ecx,0xc(%esp)&lt;br/&gt; 37c:   83 ec 10                sub    $0x10,%esp&lt;br/&gt; 37f:   8b 4d e8                mov    -0x18(%ebp),%ecx&lt;br/&gt; 382:   89 0c 24                mov    %ecx,(%esp)&lt;br/&gt; 385:   8b 4d ec                mov    -0x14(%ebp),%ecx&lt;br/&gt; 388:   89 4c 24 04             mov    %ecx,0x4(%esp)&lt;br/&gt; 38c:   8b 4d f0                mov    -0x10(%ebp),%ecx&lt;br/&gt; 38f:   89 4c 24 08             mov    %ecx,0x8(%esp)&lt;br/&gt; 393:   8b 4d f4                mov    -0xc(%ebp),%ecx&lt;br/&gt; 396:   89 4c 24 0c             mov    %ecx,0xc(%esp)&lt;br/&gt; 39a:   50                      push   %eax&lt;br/&gt; 39b:   e8 88 02 00 00          call   628 &amp;lt;dTests_MainClass_TestScalarPerf+0x628&gt;&lt;br/&gt; 3a0:   83 c4 2c                add    $0x2c,%esp&lt;br/&gt; 3a3:   46                      inc    %esi&lt;br/&gt;&amp;lt;BB&gt;:7&lt;br/&gt; 3a4:   81 fe 40 42 0f 00       cmp    $0xf4240,%esi&lt;br/&gt; 3aa:   7c 94                   jl     340 &amp;lt;dTests_MainClass_TestScalarPerf+0x340&gt;&lt;br/&gt;&lt;br/&gt;Interestingly, the assembly for the last two are pretty much equal despite a huge timing difference implying that the difference is in the function being called and not the call site.  Unfortunately due to what is probably my own fault I wasn’t able to get mono to spit out the code jumped to by&lt;br/&gt;&lt;br/&gt;call   628 &amp;lt;dTests_MainClass_TestScalarPerf+0x628&gt;&lt;br/&gt;&lt;br/&gt;The function ends way before that offset, and in fact there weren’t even any vector instructions in the assembly dump.  Honestly, I have no idea where my code went but I expect to study it further in my free time.&lt;br/&gt;&lt;br/&gt;Part 4: Conclusion&lt;br/&gt;&lt;br/&gt;Its pretty common for GCC n00bs to looks at some assembly, see some spilling, and triumphantly proclaim “This compiler sucks, and therefore was written by retards” like they accomplished something.  They don’t know what form the compiler prefers stuff be written in, and they don’t know what little seemingly harmless things disable entire groups of optimizations.  They see bad code gen, assume the compiler sucks, and don’t think any further about whats really going on.   I am definitely in that stage where I don’t know what to do to help the compiler, and despite the intentionally inflammatory way I phrased everything above, I am trying to not just shut my brain off and take the lazy way out.  Of course it doesn’t help that I don’t know x86 assembly or architecture very well at all, as I have only been programming for real platforms over the past 15 years.  All of the above results can probably be explained by someone who knows more about the C# compiler than I do, and I fully encourage that person to let me know what I did wrong.  I have a real job and therefore didn’t have a huge amount of spare time to play around and try alternate things, however I will say this.  I may not be a porgramming jeenyus, but if someone like me couldn’t coax good performance out of the compiler, I worry about the legions of other people out there that may not even know enough to wonder if there is a performance difference between Enumerable.Repeat(...).ToArray() and initialization via for loop.&lt;br/&gt;&lt;br/&gt;Epilogue: Why Many C# Programmers Are Annoying Twits&lt;br/&gt;&lt;br/&gt;Unrelated sub-rant...  I don’t want to generalize all C# programmers based on the few that answer questions on Stack Overflow, but I really don’t like C# programmers.  At all.  Someone on a forum somewhere posted a question about writing an optimized sqrt or sin function, and immediately everyone jumped on top of him with these asshole answers like “why would you ever want do do something like that” and “why do you even care.”  OK, I get it.  Blah blah blah optimize your algorithms and higher level stuff first.  But if someone is interested in why one way of doing things is faster than another, or if they want to know the fastest way to do something, you should encourage them, not try to make them feel like an ass for even wanting to know.  And its not just Stack Overflow, but all over message boards.  Some jerk ripped into a n00b for wanting to do things with pointers.  He was on some “Like O-M-G why would anyone ever want to do anything with pointers?  Thats like so totally 1970’s.”  Well, before you go running your mouth off just because someone else cares about something that you don’t find important, you may want to check and see what Microsoft has to say on the matter.&lt;br/&gt;&lt;br/&gt;&lt;a href=&quot;http://msdn.microsoft.com/en-us/magazine/cc163995.aspx&quot;&gt;http://msdn.microsoft.com/en-us/magazine/cc163995.aspx&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;So this is not a point against C#, as pointers are clearly part of the language.  Its just another thing I don’t like about so-called messageboard gurus who think just just because they are content not knowing the address and alignment of something that they can put a nail in pointers’ coffin.  These people have an incredibly annoying “why do you care, just trust the compiler” attitude that makes me worry about the future of programmers.  If you are not one of those unreasonable people, then I have no issue with you.  I guess I have the same problem with the C++ retards who think you’re nuts for thinking there is something wrong with using nested patterns of patterns of patterns.&lt;br/&gt;&lt;br/&gt;References&lt;br/&gt;&lt;br/&gt;&lt;a href=&quot;http://msdn.microsoft.com/en-us/library/ee418732(VS.85).aspx#Duality&quot;&gt;http://msdn.microsoft.com/en-us/library/ee418732(VS.85).aspx#Duality&lt;/a&gt;&lt;br/&gt;&lt;a href=&quot;http://msdn.microsoft.com/en-us/magazine/cc163995.aspx&quot;&gt;http://msdn.microsoft.com/en-us/magazine/cc163995.aspx&lt;/a&gt;&lt;br/&gt;&lt;a href=&quot;http://msdn.microsoft.com/en-us/library/ms973852.aspx&quot;&gt;http://msdn.microsoft.com/en-us/library/ms973852.aspx&lt;/a&gt;&lt;br/&gt;&lt;a href=&quot;http://msdn.microsoft.com/en-us/library/ms973858.aspx&quot;&gt;http://msdn.microsoft.com/en-us/library/ms973858.aspx&lt;/a&gt;&lt;br/&gt;&lt;a href=&quot;http://msdn.microsoft.com/en-us/library/ms973858.aspx#highperfmanagedapps_topic10&quot;&gt;http://msdn.microsoft.com/en-us/library/ms973858.aspx#highperfmanagedapps_topic10&lt;/a&gt;&lt;br/&gt;&lt;a href=&quot;http://www.cuttingedge.it/blogs/steven/pivot/entry.php?id=40#body&quot;&gt;http://www.cuttingedge.it/blogs/steven/pivot/entry.php?id=40#body&lt;/a&gt;&lt;br/&gt;&lt;a href=&quot;http://www.ademiller.com/blogs/tech/2008/08/c-inline-methods-and-optimization/&quot;&gt;http://www.ademiller.com/blogs/tech/2008/08/c-inline-methods-and-optimization/&lt;/a&gt;&lt;br/&gt;&lt;a href=&quot;http://blogs.msdn.com/b/vancem/archive/2006/02/20/535807.aspx&quot;&gt;http://blogs.msdn.com/b/vancem/archive/2006/02/20/535807.aspx&lt;/a&gt;&lt;br/&gt;&lt;a href=&quot;http://blogs.msdn.com/b/davidnotario/archive/2004/11/01/250398.aspx&quot;&gt;http://blogs.msdn.com/b/davidnotario/archive/2004/11/01/250398.aspx&lt;/a&gt;&lt;br/&gt;&lt;a href=&quot;http://blogs.msdn.com/b/ericgu/archive/2004/01/29/64717.aspx&quot;&gt;http://blogs.msdn.com/b/ericgu/archive/2004/01/29/64717.aspx&lt;/a&gt;&lt;br/&gt;&lt;a href=&quot;http://blogs.msdn.com/b/vancem/archive/2008/05/12/what-s-coming-in-net-runtime-performance-in-version-v3-5-sp1.aspx&quot;&gt;http://blogs.msdn.com/b/vancem/archive/2008/05/12/what-s-coming-in-net-runtime-performance-in-version-v3-5-sp1.aspx&lt;/a&gt;</description>
    </item>
    <item>
      <title>n00b tip: Combining writes</title>
      <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/11/27_n00b_tip__Combining_writes.html</link>
      <guid isPermaLink="false">e3de645c-5420-41ed-836f-84626ed85e20</guid>
      <pubDate>Sat, 27 Nov 2010 00:49:16 +0900</pubDate>
      <description>Look, I am well aware that some of the things I post can be a little... errrr... specific.  I only do it because I want to encourage others to do weird or non-obvious stuff as well.  The SPUs may not have the insanely awesome ARM ISA, but thats no excuse to not see what you can do with alternate approaches.&lt;br/&gt;&lt;br/&gt;As I hinted in my last post, I have an unhealthy obsession with scalar writes.  The SPUs weren’t designed for scalar anything, and this has an interesting effect on how consecutive scalar writes are handled.  Lets take a look at some really simple code&lt;br/&gt;&lt;br/&gt;void BadWrites(char *ptr, int index1, int index2, int index3, char val1, char val2, char val3)&lt;br/&gt;{&lt;br/&gt;    ptr[index1] = val1;&lt;br/&gt;    ptr[index2] = val2;&lt;br/&gt;    ptr[index3] = val3;&lt;br/&gt;}&lt;br/&gt;&lt;br/&gt;Looks simple enough.  Now lets take a look at it in a screenshot of a static analyzer&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Whoa! Actually the situation is worse that it appears.  In the case where we are storing and then reading back in from the same address, there is an additional stall not shown above.  I assume this is either because those stalls aren’t statically predictable (likely) or because the SPU hardware has advanced magic optimizations where the value doesn’t have to go through to memory (very unlikely).  Why is the code gen so bad?  Is the compiler crap? Yes.  However, there is an actual valid reason for it.  Lets say you computed two scalars that you want to write out.  Because they are scalars, you have to load in the vectors that they live in, insert the value you computed in the right spot, and then write it back out.  But what happens if those two scalars happen to live in the same vector?  You get something like this&lt;br/&gt;&lt;br/&gt;1) load in the current vector values&lt;br/&gt;[8, 6, 7, 5] and [8, 6, 7, 5]&lt;br/&gt;&lt;br/&gt;	1)	insert the scalars you calculated at the correct spot&lt;br/&gt;[8, 3, 7, 5] and [8, 6, 1, 5]&lt;br/&gt;&lt;br/&gt;	1)	write out the first vector, and then wrote out the second vector&lt;br/&gt;&lt;br/&gt;See what I did there?  The final value of the vector in LS will be [8, 6, 1, 5].  We totally overwrote and lost the first scalar we wanted to write.  Because of this situation, the compiler must be conservative and load the existing first vector, write the first vector to memory, load the second vector, write the second vector, etc.  It screws us in 2 ways.  We can’t load the existing value of each vector ahead of time, and we can’t do consecutive writes without reading in between. Aarrgghh.  However, there is a “different” way to go about it.  Depending on how odd or even bound you are, it may work out advantageously for you. &lt;br/&gt;&lt;br/&gt;The idea is dead simple.  You go ahead and do whatever you want like you can guarantee scalars don’t occur in the same vector,  Then at the last moment right before you write, you use precalculated masks to combine vectors if 2 values are in the same vector, or return the original vector otherwise.  Its also possible to do it in a nice branch free way.  You want it to look like this.  Given vectors a = [8, 3, 7, 5] and b = [8, 6, 1, 5], if a and b are the same address, then a = [8, 3, 1, 5] and b = [8, 3, 1, 5], otherwise they dont change.  The reason you are free to go ahead and write both freely is that if vector a and vector b both contain the same values, nothing is lost if one “overwrites” the other.  If a and b are at different addresses, its safe to write them anyway.  I will now show you the intrinsics version for the three value version.  I also have a 2 value version but its only one cycle shorter due to scheduling holes.  &lt;br/&gt;&lt;br/&gt;    // limitation: all addresses must be non-increasing or non-decreasing.  16, 18, 17 isnt allowed&lt;br/&gt;    // great news! most of this can be done offline or precomputed as a table of 16 vectors&lt;br/&gt;&lt;br/&gt;    // offset in the vector (0, 4, 8, or 12)&lt;br/&gt;    const vec_uint4 vec_offset_2 = spu_and(load_addr_2, 0x0000000F);&lt;br/&gt;    const vec_uint4 vec_offset_3 = spu_and(load_addr_3, 0x0000000F);&lt;br/&gt;    // vector num for seeing if 2 values occur in the same vector &lt;br/&gt;    const vec_uint4 vec_num_1 = spu_and(load_addr_1, 0xFFFFFFF0);&lt;br/&gt;    const vec_uint4 vec_num_2 = spu_and(load_addr_2, 0xFFFFFFF0);&lt;br/&gt;    const vec_uint4 vec_num_3 = spu_and(load_addr_3, 0xFFFFFFF0);&lt;br/&gt;    // if it is the same vector&lt;br/&gt;    const vec_uint4 is_same_vector_1 = spu_cmpeq(vec_num_1, vec_num_2);&lt;br/&gt;    const vec_uint4 is_same_vector_2 = spu_cmpeq(vec_num_2, vec_num_3);&lt;br/&gt;    const vec_uint4 shuffle_table_offset_1 = spu_sub(sixteen, vec_offset_2);&lt;br/&gt;    const vec_uint4 shuffle_table_offset_2 = spu_sub(sixteen, vec_offset_3);&lt;br/&gt;    // rot_mask_1 is {0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000} &lt;br/&gt;    // rot_mask_2 is {0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF}&lt;br/&gt;    const vec_uint4 shuffle_table_entry_1 = (vec_uint4)si_rotqby((qword)rot_mask_1, (qword)shuffle_table_offset_1);&lt;br/&gt;    const vec_uint4 shuffle_table_entry_2 = (vec_uint4)si_rotqby((qword)rot_mask_2, (qword)shuffle_table_offset_2);&lt;br/&gt;    const vec_uint4 vector_combinor_1 = spu_and(shuffle_table_entry_1, is_same_vector_1);&lt;br/&gt;    const vec_uint4 vector_combinor_2 = spu_and(shuffle_table_entry_2, is_same_vector_2);&lt;br/&gt;    // combine vectors and freely write without worrying what stomps what&lt;br/&gt;    const vec_uint4 vec_to_write_1 = spu_sel(vals_1, vals_2, vector_combinor_1);&lt;br/&gt;    const vec_uint4 vec_to_write_2 = spu_sel(vals_2, vec_to_write_1, is_same_vector_1);&lt;br/&gt;    const vec_uint4 vec_to_write_3 = spu_sel(vals_3, vec_to_write_2, vector_combinor_2);&lt;br/&gt;&lt;br/&gt;Now lets take a look at how the two and three value versions schedule (in a vacuum).  Ignore the final adds at 15 and 17.  They are debug code.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Not bad at all.  If you are doing stuff like this (2D image processing) then you are probably light on the even pipeline and very heavy on the odd.  This may schedule right in for you.  If not, like I said, the values can all be precomputed into a table of 16 vectors, pretty much collapsing all the above code into a load or two that can be done in parallel with whatever else you are doing.  Pretty cool, eh?  I use stuff like this a lot when rasterizing lines.  I wanted to use it in the PixelJunk Shooter 2 lighting system but it never really made it in.  Oh well, maybe next game!</description>
    </item>
    <item>
      <title>n00b tip: Gather round, children</title>
      <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/8/24_n00b_tip__Gathering_thought_experiment.html</link>
      <guid isPermaLink="false">c7974035-0b8b-4fcf-9827-1988208aa7be</guid>
      <pubDate>Tue, 24 Aug 2010 22:49:21 +0900</pubDate>
      <description>I’ve been thinking quite a bit about “gather” loads recently.  By gather loads, I mean loading stuff from all over memory into one vector. Larrabee has support for this built in, making all kinds of cool stuff possible, but the SPUs cant even load a scalar.  Memory is loaded and stored one 16 byte aligned quadword at a time.  Take a look at the following bit of unoptimized C code.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Thats the kind of thing I am talking about.  We have a table of offsets packed in a vector.  We want to use those offsets as an index into some area of local store, load that byte, and pack 16 of them into a vec_uchar16.  While the above code is pretty compact, the resulting assembly is anything but.  Because you have to generate shuffle masks, rotate, and do all kinds of work, the assembly looks like this:&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Aarrgghh, indeed!  For anyone not used to working on the SPUs or any CPU that lacks scalar instructions, its a little shocking.  The compiler was smart enough to pull the loop-invariant shuffle masks outside the loop ( cbd instruction ) but the loop itself is a total mess.  We should be able to do better.&lt;br/&gt;&lt;br/&gt;I thought about it for a few hours and came up with 4 alternatives.  Many of them involved calculating a rotate amount from the bottom 4 bits of the load address, but the one that ended up being the fastest was one that didn’t even use any rotates ( except the ones that extract the offsets from addresses 1, 2, 3, and 4 ).  I’ve been told that all the really good PS3 programmers take pictures of notes and use it as their presentations, so in an effort to try and make myself seem like I know what I am talking about, here is an outline of my idea.  Ignore my 3rd grade handwriting,&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;As you can see, we are going from a vector of address offsets to some actual pixels packed in a vector.  I am assuming we are starting at a point where we have all our data loaded into vectors, albeit in the wrong position.  First we take the offsets in the table and bitwise and by 0xF.  This gives us the bottom 4 address bits which correspond to the pixel we want’s offset inside the loaded vector.  &lt;br/&gt;&lt;br/&gt;Then we add {0, 16, 0, 16} to it.  When making a shuffle mask, 0-15 selects bytes from the first vector and 16-31 selects bytes from the second vector.  By adding 16, I am keeping the offset the same, but I am making it pull from the second vector argument to SHUFB!  This becomes useful later on ;)  &lt;br/&gt;&lt;br/&gt;We then take that result and use SHUFB to rearrange the bytes into another shuffle mask.  One that will not only extract the right bytes from the right places, but it will also put the result in just the right position to later be combined with a SELB.  The 8 masks below should demonstrate what I am trying to do.  Technically you only need 2 masks since the 0 fields could be anything&lt;br/&gt;&lt;br/&gt;	{ 3,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }&lt;br/&gt;	{ 0,0,11,15,0,0,0,0,0,0,0,0,0,0,0,0 }&lt;br/&gt;	{ 0,0,0,0,3,7,0,0,0,0,0,0,0,0,0,0 }&lt;br/&gt;	{ 0,0,0,0,0,0,11,15,0,0,0,0,0,0,0,0 }&lt;br/&gt;	{ 0,0,0,0,0,0,0,0,3,7,0,0,0,0,0,0 }&lt;br/&gt;	{ 0,0,0,0,0,0,0,0,0,0,11,15,0,0,0,0 }&lt;br/&gt;	{ 0,0,0,0,0,0,0,0,0,0,0,0,3,7,0,0 }&lt;br/&gt;	{ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,11,15 }&lt;br/&gt;&lt;br/&gt;Next we apply those shuffle masks to the actual loaded data.  If V is a value we want, this gives the following sample result&lt;br/&gt;&lt;br/&gt;V1, V2,  X,   X,   X, X, X, X, X, X, X, X, X, X, X, X&lt;br/&gt;  X,  X,  V3, V4,  X, X, X, X, X, X, X, X, X, X, X, X&lt;br/&gt;&lt;br/&gt;As you can see, the values in the vectors line up just right so that we can select using SELB.   The above example only combines 4 values but the full version shown below combines all 16 loaded pixels into one vector. &lt;br/&gt;&lt;br/&gt;The results are shown below.  My assembly isn’t optimized yet but already it is a huge improvement over the C++ code.  I went from 81 cycles to 50 cycles, which is significant considering the number of times my loop runs.  As an added bonus, the even pipeline is now free enough to schedule the 99% even instructions that make up the lighting code!&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;I’m not saying that beating the compiler is an accomplishment, because it very rarely is.  I’m also not implying that I came up with something new that no one in the history of the universe has ever done.  I just think its a kinda cool trick and felt like sharing.  Hopefully somebody somewhere will find it useful&lt;br/&gt;&lt;br/&gt;EDIT: old version posted above.  I cut another 4 cycles off the loop and I already have some ideas to get it down even further.&lt;br/&gt;&lt;br/&gt;EDIT: a lot of you on twitter have pointed out that packing 16 pixels in a vec_uchar16 is usually a Very Bad Thing as the SPU byte instruction set is rather anemic.  This is true but the idea is custom designed to go with something like &lt;a href=&quot;Entries/2010/4/17_n00b_tip__Psychic_computing.html&quot;&gt;this&lt;/a&gt;.  Its almost all even instructions and can be used to calculate the length of rays 16 pixels at a time.  I’m not saying it will be useful in your everyday life, but rather its just an interesting trick that may save you someday :)  Besides, if you’re doing math, then just pack 8 shorts or 4 ints.  The technique still works, I’m only using char because I can</description>
    </item>
    <item>
      <title>#OmarVsJaymin: because we all know someone who sucks...</title>
      <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/5/24_OmarVsJaymin__best_thing_on_Twitter.html</link>
      <guid isPermaLink="false">1a6cde7b-ceae-4bd2-b524-ce43045c2a49</guid>
      <pubDate>Mon, 24 May 2010 23:26:59 +0900</pubDate>
      <description>&lt;a href=&quot;http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/5/24_OmarVsJaymin__best_thing_on_Twitter_files/i-can-has-programming-language.jpg&quot;&gt;&lt;img src=&quot;http://6cycles.maisonikkoku.com/6Cycles/6cycles/Media/object002_1.jpg&quot; style=&quot;float:left; padding-right:10px; padding-bottom:10px; width:176px; height:156px;&quot;/&gt;&lt;/a&gt;So, here is how this all started.  I have a history of saying things on Twitter that reflect my “screw practicality, programming is fun” philosophy.  This annoys my coworker Omar to no end.  Omar loves to troll/harass me, and I enjoy returning the favor.  One day after observing a particularly uninteresting fight over languages, our boss Dylan realized that by creating a Twitter list, Omar and I wouldn’t have to specifically mention each other therefore freeing up precious characters with which to insult each other.&lt;br/&gt;&lt;br/&gt;Unlike facebook, I respect my users’ privacy.  Therefore I am going to start by posting mine, and add posts from other people as they give me permission to use their names and/or quotes.  I really hope to be able to post others soon because its some of the most insanely funny stuff I have ever heard&lt;br/&gt;&lt;br/&gt;By the way, the list is ongoing and anyone can and should join in!  #OmarVsJaymin on Twitter!&lt;br/&gt;&lt;br/&gt;@okonomiyonda&lt;br/&gt;#OmarVsJaymin you miss L1 so much that you cry yourself to sleep every night with a picture of it under your pillow&lt;br/&gt;#OmarVsJaymin IBM changed the PS3s cache associativity section in the HW manual to a disclaimer. It reads &amp;quot;no-way associated with your code&amp;quot;&lt;br/&gt;#OmarVsJaymin you spill so much that even BP is glad they're not you&lt;br/&gt;#OmarVsJaymin I heard your render func is in a contest with Gran Turismo 5 to see which is finished first&lt;br/&gt;#OmarVsJaymin your shader only does one lookup... and its the phone number for the suicide prevention hotline&lt;br/&gt;#OmarVsJaymin Miss Teen S. Carolina saw your stl container overusage and thought it could help countries like The Iraq that dont have maps&lt;br/&gt;#OmarVsJaymin Someone told you it's best to work with the data in cache while its hot... so you disconnected your motherboard fan&lt;br/&gt;#OmarVsJaymin your code is so overengineered, that even Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides cant unerstand it&lt;br/&gt;#OmarVsJaymin Your frame is so long that Tuner gave up looking for your sync function halfway through your GameMonkey upate&lt;br/&gt;#OmarVsJaymin Your DMA transfers have so many stalls, I thought your MFC was having a matsuri&lt;br/&gt;#OmarVsJaymin @mike_action realized Insomniac can greatly improve graphics AND go back to 60fps just by firing you&lt;br/&gt;#OmarVsJaymin Oh, hai! I rewrote your code to use linked lists. It reduced cache misses by 75%&lt;br/&gt;#OmarVsJaymin the linker took your object file and performed relocation... to the trash&lt;br/&gt;#OmarVsJaymin The Mayans looked into the future, saw 2012 lines of your code, and were convinced humanity is screwed&lt;br/&gt;#OmarVsJaymin Speaking of SPURS, you have so few jobs scheduled that the US government had to adjust their unemployment numbers&lt;br/&gt;#OmarVsJaymin when I tried compiling your code, gcc output suicide_note.txt and deleted itself off my hard drive.&lt;br/&gt;#OmarVsJaymin I heard your cache just replaced Linfen, China as the most polluted place on Earth!&lt;br/&gt;#OmarVsJaymin your render is so slow that SIGGRAPH people refuse to call your algorithm real-time&lt;br/&gt;#OmarVsJaymin It wasn't a die yield problem. The 8th SPU disabled itself so that it would never have to run your code&lt;br/&gt;#OmarVsJaymin If we ever lose all the Amazon rainforest trees, we could replace them with your code. They have the same number of branches&lt;br/&gt;#OmarVsJaymin Sony removed OtherOS support from the PS3 roughly 3 minutes after receiving the PS3 Linux demo you sent with your application&lt;br/&gt;#OmarVsJaymin your code is so bad, not even Arizona would profile it&lt;br/&gt;#OmarVsJaymin your ps2 gfx are so poor that the GIF returns anything you xgkick with a note that says the GS doesn't live here anymore&lt;br/&gt;&lt;br/&gt;@aras_p&lt;br/&gt;#OmarVsJaymin Your code's so bad that applying OOP&amp;amp;OOD can actually improve it!&lt;br/&gt;#OmarVsJaymin your code is in GoF Design Patterns book as an example!&lt;br/&gt;#OmarVsJaymin I know it's a surprise for you, but it's 16 milliseconds, not megaseconds!&lt;br/&gt;#OmarVsJaymin You hope you'll have L2 cache miss for each variable access. Right now each access involves DVD seeks!&lt;br/&gt;#OmarVsJaymin Your code is so slow that rewriting it in PHP would make it faster!&lt;br/&gt;#OmarVsJaymin If anyone opened up your code, there would be more ash than from Eyjafjallajökull&lt;br/&gt;&lt;br/&gt;@dylancuthbert&lt;br/&gt;#OmarVsJaymin your code is so ugly the raid mirror cracked when you tried to save the file&lt;br/&gt;#OmarVsJaymin your coding methods are so backwards they've added it to the school curriculum in Texas!&lt;br/&gt;#OmarVsJaymin I've never seen a priest code before, I mean, you must be a priest right? Your code is running on pure faith and no logic..&lt;br/&gt;#OmarVsJaymin Your latencies are so large I can write not just one but several games in them&lt;br/&gt;#OmarVsJaymin I know this may be a shock but RAM access timings have nothing to do with how often you click on goatse&lt;br/&gt;#OmarVsJaymin your coding style is lauded as the next big thing.... at COBOL group get-togethers!!&lt;br/&gt;#OmarVsJaymin imagine an egg hitting a concrete pavement, that's how hard-wired, inflexible and brittle your code is&lt;br/&gt;#OmarVsJaymin your code is so wretched the cache invalidates as it loads it, and hard disks add it to their bad block lists&lt;br/&gt;#OmarVsJaymin in the future true sentient AI will be invented solely to stop you from ever touching a keyboard and soiling code again&lt;br/&gt;#OmarVsJaymin I never believed in chaos theory until I saw your variable naming convention&lt;br/&gt;&lt;br/&gt;@pat_wilson&lt;br/&gt;#OmarVsJaymin Your code is so bad, Richard Stallman suggested that you keep it proprietary.&lt;br/&gt;#OmarVsJaymin Your code runs so slow your data brings sleeping bags to camp-out in the cache lines.&lt;br/&gt;#OmarVsJaymin Your shader code is so terrible, PIX always suggests GPR allocations of -1.&lt;br/&gt;#OmarVsJaymin Daikatana's code makes your code it's bitch.&lt;br/&gt;&lt;br/&gt;@richard_a_sim&lt;br/&gt;#OmarVsJaymin Your code is so bloated that @okonomiyonda's brain wrapped to 0 while counting the cycles in its inner loop&lt;br/&gt;#OmarVsJaymin Your code is so inflexible that Sony is considering PLD's for the PS4, as it'll never get patched&lt;br/&gt;#OmarVsJaymin Clean, clear, and under control; three things that will never be said about your code&lt;br/&gt;&lt;br/&gt;@bkaradzic&lt;br/&gt;#OmarVsJaymin After Bjarne Stroustrup saw your code, he apologized for inventing C++.&lt;br/&gt;#OmarVsJaymin The best stairs you ever drew was done by using thread profiler.&lt;br/&gt;#OmarVsJaymin Your matrix template library got rejected by Boost committee because it was too generic for their taste.&lt;br/&gt;#OmarVsJaymin Your game code base just won IOCCC.&lt;br/&gt;&lt;br/&gt;@keyframe&lt;br/&gt;#OmarVsJaymin Your code is the reason why @mike_acton buys post-its in bulk wholesale lots!&lt;br/&gt;#OmarVsJaymin your code is so lame, you can encode mp3 with it&lt;br/&gt;#OmarVsJaymin Oh yeah? I hear your code is so slow, Adobe has been called in to optimize it!&lt;br/&gt;#OmarVsJaymin Javascript based scripts in your game are considered an optimization step&lt;br/&gt;#OmarVsJaymin By popular demand, your code backup is in /dev/null/&lt;br/&gt;#OmarVsJaymin What does your code have in common with C? No class.&lt;br/&gt;&lt;br/&gt;@pinskia&lt;br/&gt;#OmarVsJaymin Your code can only be compiled with a &amp;quot;C with classes&amp;quot; compiler.&lt;br/&gt;#OmarVsJaymin Your code needs Visual Studio to compile.&lt;br/&gt;#OmarVsJaymin omg you fired Pinski with that code?????&lt;br/&gt;#OmarVsJaymin your code needs so much help matt smith cannot help you.&lt;br/&gt;#OmarVsJaymin your code needs so much help that the California budget looks sane.&lt;br/&gt;#OmarVsJaymin your code is so bad it was recalled like grey davis.&lt;br/&gt;#OmarVsJaymin your code is do bad that people were questioning it if it was truely born in the us.&lt;br/&gt;&lt;br/&gt;@drewthaler&lt;br/&gt;#OmarVsJaymin Your code is so bad your child processes disowned you.&lt;br/&gt;#OmarVsJaymin Your framerate is like a bottle of sunscreen - it has to be measured in spf&lt;br/&gt;#OmarVsJaymin Steve Jobs called, he heard you were thinking about making an iPhone game and wanted to fast-track your app store rejection.&lt;br/&gt;#OmarVsJaymin Your code is so bad Fred Brooks went and found a silver bullet to shoot it with.&lt;br/&gt;</description>
      <enclosure url="http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/5/24_OmarVsJaymin__best_thing_on_Twitter_files/i-can-has-programming-language.jpg" length="52655" type="image/jpeg"/>
    </item>
    <item>
      <title>Pseudorandom PS3 Linux stuff</title>
      <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/5/1_Pseudorandom_PS3_Linux_stuff.html</link>
      <guid isPermaLink="false">2a1bc8aa-82b7-4bf9-9af9-3464f7d4fdd6</guid>
      <pubDate>Sat, 1 May 2010 17:09:17 +0900</pubDate>
      <description>&lt;a href=&quot;http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/5/1_Pseudorandom_PS3_Linux_stuff_files/DSC03696.jpg&quot;&gt;&lt;img src=&quot;http://6cycles.maisonikkoku.com/6Cycles/6cycles/Media/object020_1.jpg&quot; style=&quot;float:left; padding-right:10px; padding-bottom:10px; width:176px; height:136px;&quot;/&gt;&lt;/a&gt;So, I got a request from someone on Twitter ( yes Jim, it was you ) to talk about setting up a cross compiler to build PPU and SPU binaries on Mac OS X.  I’ll get to that later in the post but first I wanted to start out with a quick brain dump of the stuff I had to deal with previously.  &lt;br/&gt;&lt;br/&gt;	•	I am using a 60GB retail kit, most likely a launch unit&lt;br/&gt;	•	I installed Yellow Dog 6.2 and it was pretty problem-free.  Screw you Barcelona and IBM for only supporting Fedora&lt;br/&gt;	•	Make sure you type install-ps3-1080p at the kboot prompt. Only hi-def cables are supported for install&lt;br/&gt;	•	I have been using Linux on and off for 14 years, and this is the first time sound worked right out of the box! &lt;br/&gt;	•	Customize your install to also include KDE. It seems to be the only useable window manager, and is the fastest one&lt;br/&gt;	•	Include gnome and that other one as well.  If you don’t, for some odd reason it doesn’t install wicd&lt;br/&gt;	•	For the time being, its best not to update packages. I did a full update and wicd broke horribly ( edit: workaround available )&lt;br/&gt;	•	Go into Network and disable wlan0 and uncheck the “start on boot” option.  Its best if Network has no knowledge of wlan0 whatsoever&lt;br/&gt;	•	Instead, use wicd.  It seems to just work perfectly, even for WPA1/2, and it even has the option to activate wlan0 on boot&lt;br/&gt;	•	I forget the command, but you should edit some bootloader file to make sure you boot in full screen ( black bars by default )&lt;br/&gt;	•	You can safely use yum to update your gnu toolchain and get the newest PPU/SPU compilers&lt;br/&gt;	•	Be sure to get rlogin working without a password so you can use the makefile below&lt;br/&gt;	•	Use the Samba config GUI to set up Samba.  It gives you an easy way to copy files without using rcp and opening port 514.&lt;br/&gt;	•	More stuff to be added as I remember it...&lt;br/&gt;&lt;br/&gt;If you set up everything correctly, you should already have your toolchain and libspe installed.  I know this because my system tells me so.&lt;br/&gt;&lt;br/&gt;[okonomiyonda@localhost YellowDog]$ sudo rpm -i libspe*&lt;br/&gt;package libspe2-2.2.80-132.ydl6.1 is already installed&lt;br/&gt;package libspe-2.2.80-132.ydl6.1 is already installed&lt;br/&gt;package libspe2-devel-2.2.80-132.ydl6.1 is already installed&lt;br/&gt;package libspe-devel-2.2.80-132.ydl6.1 is already installed&lt;br/&gt;&lt;br/&gt;[okonomiyonda@localhost YellowDog]$ spu-gcc --version&lt;br/&gt;spu-gcc (GCC) 4.1.1&lt;br/&gt;&lt;br/&gt;[okonomiyonda@localhost YellowDog]$ ppu-gcc --version&lt;br/&gt;ppu-gcc (GCC) 4.1.1&lt;br/&gt;&lt;br/&gt;That settles that.  There is a newer 3.x version of the library but at the moment it seems like it requires some black magic to make it work on a non-Fedora system.  Admittedly I only looked for about 5 minutes so there could already be a YellowDog rpm out there somewhere.&lt;br/&gt;&lt;br/&gt;Finally, I get to the reason for this post: cross compiling.  I have a few reasons for wanting to do so.  First of all, once my fat PS3 dies then its game over for my home PS3 dev life, so its nice to not have to have the PS3 turned on just so I can compile and link.  Also, its good for my electricity bill.  Also, my 16 core 32GB RAM mac pro can probably compile faster than the PS3 :)  &lt;br/&gt;&lt;br/&gt;This used to contain step by step instructions for building a “working” cross compiler on my Mac, until I noticed some of the steps I had to take involved editing the generated stdint.h to wrap it with #ifndef ASSEMBLY in the exact time frame after the file is generated but before it starts getting -include’d in .S assembly files.  So yeah, something more fundamental was wrong with my build environment.  On @groby’s advice, I ended up just installing virtualbox, Fedora Code 12, and following @Mike_Acton’s excellent cross compiler instructions&lt;br/&gt;&lt;br/&gt;&lt;a href=&quot;http://cellperformance.beyond3d.com/articles/2006/11/cross-compiling-for-ps3-linux-1.html&quot;&gt;http://cellperformance.beyond3d.com/articles/2006/11/cross-compiling-for-ps3-linux-1.html&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;To get you started, I am including a stripped down version of the makefile I use at the core of my build system.  Why stripped down?  Because you don’t have my target manager and you don’t have my job system.  I will probably try to compartmentalize some stuff and repost a new version of the makefile that compiles SPU jobs for both SPU and PPU, allowing you to switch between them at runtime.&lt;br/&gt;&lt;br/&gt;First, a few warnings.  I am no makefile expert, so I am probably doing things very inefficiently.  I will probably rewrite at some point.  Here is what your directory structure should look like ( locally on your virtualbox Linux box )&lt;br/&gt;&lt;br/&gt;directories that need to exist:&lt;br/&gt;&lt;br/&gt;SomeProj                   - project root. Contains final ppu elf&lt;br/&gt;SomeProj/ppu               - all ppu C/C++/ASM files&lt;br/&gt;SomeProj/spu/SomeJob1      - all C/C++/ASM files for SPU job SomeJob1&lt;br/&gt;SomeProj/spu/SomeJob2      - all C/C++/ASM files for SPU job SomeJob2&lt;br/&gt;SomeProj/data-src          - all your source data, and the makefile needed to process it&lt;br/&gt;SomeProj/data-ps3          - where your processed data goes&lt;br/&gt;&lt;br/&gt;Here are some useful targets to build:&lt;br/&gt;&lt;br/&gt;build-all: builds all code &lt;br/&gt;data: makes data in data-src and builds it into data-ps3&lt;br/&gt;install: copies everything to the target PS3&lt;br/&gt;copy-bins: copies only binaries to the PS3&lt;br/&gt;copy-data: copies only data to the PS3&lt;br/&gt;run: runs your program on the PS3&lt;br/&gt;clean: clean non-data&lt;br/&gt;clean-data: clean data&lt;br/&gt;&lt;br/&gt;Probably the only thing you will have to struggle with is making rlogin work without having to manually enter the password.  Its not that hard and there are instructions on how to do it on the internets.  Lemme know if you have problems.  See the first 2 makefile sections for customizing things for your system.  Now you are ready to enjoy programming without having to turn your PS3 on every time, and without having to manually login and copy things every single time.&lt;br/&gt;&lt;br/&gt;###############################################################################&lt;br/&gt;# customize me for your system&lt;br/&gt;###############################################################################&lt;br/&gt;CELL_ROOT=/opt/cell&lt;br/&gt;CELL_BIN=$(CELL_ROOT)/bin&lt;br/&gt;# on my sysyem, PS3_IPADDR is an environment variable set by my Targit Manajur.  &lt;br/&gt;# I am hardcoding it here so other people can use this.  Note that for the &lt;br/&gt;# copy targets to work, you have to set up rlogin so that rcp and rsh can be &lt;br/&gt;# evoked without users needing to enter passwords&lt;br/&gt;PS3_IPADDR=10.0.1.6&lt;br/&gt;PS3_ACCOUNT=okonomiyonda&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;###############################################################################&lt;br/&gt;#customize me for your project. We *could* deduce this from the current dir&lt;br/&gt;###############################################################################&lt;br/&gt;ELFNAME=hello_world_ppu&lt;br/&gt;REMOTE_PROJECT_DIR=../PS3Projects/HelloWorld&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;###############################################################################&lt;br/&gt;# everything below here should be left alone&lt;br/&gt;###############################################################################&lt;br/&gt;&lt;br/&gt;# I failed to build objs into a separate directory AND make dependencies work :(&lt;br/&gt;# so I am going with working dependencies &lt;br/&gt;#OBJDIR=objs&lt;br/&gt;OBJDIR=.&lt;br/&gt;PPU_OBJDIR=$(OBJDIR)/ppu&lt;br/&gt;SPU_OBJDIR=$(OBJDIR)/spu&lt;br/&gt;&lt;br/&gt;#for cleaning.  I feel like this is the absolute slowest way to do this&lt;br/&gt;CLEANFILES=\&lt;br/&gt;${shell find ./ -name &amp;quot;*.d&amp;quot; -type f -print -nowarn} \&lt;br/&gt;${shell find ppu/ -name &amp;quot;*.o&amp;quot; -type f -print -nowarn} \&lt;br/&gt;${shell find spu/ -name &amp;quot;*.o&amp;quot; -type f -print -nowarn} \&lt;br/&gt;${shell find spu/ -name &amp;quot;*.elf&amp;quot; -type f -print -nowarn} \&lt;br/&gt;$(ELFNAME)&lt;br/&gt;&lt;br/&gt;# compilers and flags&lt;br/&gt;PPU_GCC=$(CELL_BIN)/ppu-g++&lt;br/&gt;SPU_GCC=$(CELL_BIN)/spu-g++&lt;br/&gt;PPU_AS=$(CELL_BIN)/ppu-as&lt;br/&gt;SPU_AS=$(CELL_BIN)/spu-as&lt;br/&gt;EMBEDSPU=$(CELL_BIN)/ppu-embedspu&lt;br/&gt;PPU_CXXFLAGS=-I$(CELL_ROOT)/sysroot/usr/include&lt;br/&gt;SPU_CXXFLAGS=-I$(CELL_ROOT)/spu/include&lt;br/&gt;PPU_LDFLAGS= -lspe2 -m64&lt;br/&gt;EMBEDSPU_FLAGS=-m64&lt;br/&gt;&lt;br/&gt;# PPU sources&lt;br/&gt;PPU_CXX_SOURCE=${shell find ppu/ -name &amp;quot;*.cpp&amp;quot; -type f -print -nowarn}&lt;br/&gt;PPU_ASM_SOURCE=${shell find ppu/ -name &amp;quot;*.s&amp;quot; -type f -print -nowarn}&lt;br/&gt;PPU_CXX_OBJS=$(patsubst %.cpp,%.o,$(PPU_CXX_SOURCE))&lt;br/&gt;PPU_ASM_OBJS=$(patsubst %.s,%.o,$(PPU_ASM_SOURCE))&lt;br/&gt;PPU_CXX_DEPFILES=$(addprefix $(OBJDIR)/, $(patsubst %.cpp,%.d, $(PPU_CXX_SOURCE)))&lt;br/&gt;PPU_ASM_DEPFILES=$(addprefix $(OBJDIR)/, $(patsubst %.s,%.d, $(PPU_ASM_SOURCE)))&lt;br/&gt;PPU_DEPFILES=$(PPU_CXX_DEPFILES) $(PPU_ASM_DEPFILES) &lt;br/&gt;&lt;br/&gt;# SPU sources&lt;br/&gt;SPU_JOB_DIRS=${shell find spu/* -type d -print}&lt;br/&gt;GET_LAST_DIR=$(lastword $(subst /, ,$(1)))&lt;br/&gt;# force the SPU jobs to relink every time by making them depend on non-existent fle files&lt;br/&gt;SPU_JOB_ELFS=$(addprefix $(OBJDIR)/,$(foreach dir,$(SPU_JOB_DIRS),$(addsuffix /$(call GET_LAST_DIR,$(dir)).fle,$(dir))))&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;SPU_JOBS_CXX_SOURCE=${shell find spu/ -name &amp;quot;*.cpp&amp;quot; -type f -print -nowarn}&lt;br/&gt;SPU_JOBS_ASM_SOURCE=${shell find spu/ -name &amp;quot;*.s&amp;quot; -type f -print -nowarn}&lt;br/&gt;SPU_JOBS_CXX_OBJS=$(patsubst %.cpp,%.o,$(SPU_JOBS_CXX_SOURCE))&lt;br/&gt;SPU_JOBS_ASM_OBJS=$(patsubst %.s,%.o,$(SPU_JOBS_ASM_SOURCE))&lt;br/&gt;SPU_CXX_DEPFILES=$(addprefix $(OBJDIR)/, $(patsubst %.cpp,%.d, $(SPU_JOBS_CXX_SOURCE))) &lt;br/&gt;SPU_ASM_DEPFILES=$(addprefix $(OBJDIR)/, $(patsubst %.s,%.d, $(SPU_JOBS_ASM_SOURCE)))&lt;br/&gt;SPU_DEPFILES=$(SPU_CXX_DEPFILES) $(SPU_ASM_DEPFILES)&lt;br/&gt;&lt;br/&gt;# for copying stuff in copy targets&lt;br/&gt;DEST_DATA_TOP_DIR=data-ps3&lt;br/&gt;#DATA_DIRS_TO_CREATE=${addprefix $(REMOTE_PROJECT_DIR)/, ${shell find $(DEST_DATA_TOP_DIR) -name &amp;quot;*&amp;quot; -type d -print -nowarn}}&lt;br/&gt;BINS_TO_COPY=$(ELFNAME) $(SPU_JOB_ELFS)&lt;br/&gt;BIN_DIRS_TO_CREATE=$(REMOTE_PROJECT_DIR) ${addprefix $(REMOTE_PROJECT_DIR)/, $(dir $(SPU_JOB_ELFS))}&lt;br/&gt;# HACKFEST! no longer used because we now embed our SPU jobs&lt;br/&gt;HACK_BIN_COPY_COMMAND=$(foreach bin,$(BINS_TO_COPY),rcp $(bin) $(PS3_ACCOUNT)@$(PS3_IPADDR):$(REMOTE_PROJECT_DIR)/$(bin) ; )&lt;br/&gt;SPU_BIN_IMAGE_PREFIX=spu_bin_image_prefix_&lt;br/&gt;&lt;br/&gt;#warning... this is broken in gcc 4.1.1.  The -MT switch screws up so everything is always rebuilt&lt;br/&gt;DEPFILES=$(SPU_DEPFILES) $(PPU_DEPFILES)&lt;br/&gt;&lt;br/&gt;NODEPS:=clean&lt;br/&gt;&lt;br/&gt;all: $(SPU_JOBS_CXX_OBJS) $(SPU_JOBS_ASM_OBJS) $(PPU_CXX_OBJS) $(PPU_ASM_OBJS) $(SPU_JOB_ELFS)&lt;br/&gt;    $(PPU_GCC) $(PPU_LDFLAGS) -o $(ELFNAME) $(PPU_OBJDIR)/*.o&lt;br/&gt;    @echo good lookin on this compile, B!&lt;br/&gt;&lt;br/&gt;debug-deps: &lt;br/&gt;    @echo dependency files: $(DEPFILES)&lt;br/&gt;&lt;br/&gt;run:&lt;br/&gt;    rsh -l $(PS3_ACCOUNT) $(PS3_IPADDR)  $(REMOTE_PROJECT_DIR)/$(ELFNAME)&lt;br/&gt;&lt;br/&gt;install: make-project-dir copy-data copy-bins&lt;br/&gt;&lt;br/&gt;make-project-dir:&lt;br/&gt;    @echo creating project directory...&lt;br/&gt;    rsh -l $(PS3_ACCOUNT) $(PS3_IPADDR) mkdir -p $(REMOTE_PROJECT_DIR)&lt;br/&gt;&lt;br/&gt;copy-data:&lt;br/&gt;    @echo copying data...&lt;br/&gt;    rcp -r $(DEST_DATA_TOP_DIR) $(PS3_ACCOUNT)@$(PS3_IPADDR):$(REMOTE_PROJECT_DIR)/&lt;br/&gt;&lt;br/&gt;copy-bins:&lt;br/&gt;    @echo creating bin directories $(BIN_DIRS_TO_CREATE)&lt;br/&gt;    rsh -l $(PS3_ACCOUNT) $(PS3_IPADDR) mkdir -p $(BIN_DIRS_TO_CREATE)&lt;br/&gt;    # spu jobs are embedded now so no need to copu the elf files&lt;br/&gt;    rcp $(ELFNAME) $(PS3_ACCOUNT)@$(PS3_IPADDR):$(REMOTE_PROJECT_DIR)/$(ELFNAME)&lt;br/&gt;&lt;br/&gt;data:&lt;br/&gt;    make -C data-src/&lt;br/&gt;&lt;br/&gt;# depends on non existent fle files to force this target to run every time&lt;br/&gt;$(SPU_JOB_ELFS):&lt;br/&gt;    $(SPU_GCC) $(SPU_LDFLAGS) -o $(subst .fle,.elf,$@) $(dir $@)*.o&lt;br/&gt;    $(EMBEDSPU) $(EMBEDSPU_FLAGS) $(SPU_BIN_IMAGE_PREFIX)$(subst .fle,,$(notdir $@)) $(subst .fle,.elf,$@) $(PPU_OBJDIR)/$(notdir $(subst .fle,.elf,$@)).o&lt;br/&gt;&lt;br/&gt;$(SPU_JOBS_CXX_OBJS): %.o: %.cpp&lt;br/&gt;    mkdir -p $(OBJDIR)/$(dir $(@))&lt;br/&gt;    $(SPU_GCC) $(SPU_CXXFLAGS) -MM -MT $@ $&amp;lt; | sed s/objs[a-zA-Z0-9/]*.o// &gt; $(OBJDIR)/$(subst .o,.d,$@)&lt;br/&gt;    $(SPU_GCC) $(SPU_CXXFLAGS) -c -o $(OBJDIR)/$@ $&amp;lt;&lt;br/&gt;&lt;br/&gt;$(SPU_JOBS_ASM_OBJS): %.o: %.s&lt;br/&gt;    mkdir -p $(OBJDIR)/$(dir $(@))&lt;br/&gt;    $(SPU_AS) $(SPU_ASFLAGS) -o $(OBJDIR)/$@ $&amp;lt;&lt;br/&gt;&lt;br/&gt;$(PPU_CXX_OBJS): %.o: %.cpp&lt;br/&gt;    mkdir -p $(OBJDIR)/$(dir $(@))&lt;br/&gt;    $(PPU_GCC) $(PPU_CXXFLAGS) -MM -MT $@ $&amp;lt; | sed s/objs[a-zA-Z0-9/]*.o// &gt; $(OBJDIR)/$(subst .o,.d,$@)&lt;br/&gt;    $(PPU_GCC) $(PPU_CXXFLAGS) -c -o $(OBJDIR)/$@ $&amp;lt;&lt;br/&gt;&lt;br/&gt;$(PPU_ASM_OBJS): %.o: %.s&lt;br/&gt;    mkdir -p $(OBJDIR)/$(dir $(@))&lt;br/&gt;    $(PPU_AS) $(PPU_ASFLAGS) -o $(OBJDIR)/$@ $&amp;lt;&lt;br/&gt;&lt;br/&gt;clean:&lt;br/&gt;    rm -rf $(CLEANFILES)&lt;br/&gt;&lt;br/&gt;clean-data:&lt;br/&gt;    make -C data-src/ clean&lt;br/&gt;&lt;br/&gt;ifeq (0, $(words $(findstring $(MAKECMDGOALS), $(NODEPS))))&lt;br/&gt;-include $(DEPFILES)&lt;br/&gt;#$(warning including depfiles $(DEPFILES))&lt;br/&gt;endif&lt;br/&gt;</description>
      <enclosure url="http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/5/1_Pseudorandom_PS3_Linux_stuff_files/DSC03696.jpg" length="61470" type="image/jpeg"/>
    </item>
    <item>
      <title>n00b tip: Wont you be my neighbor?</title>
      <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/24_n00b_tip__Wont_you_be_my_neighbor.html</link>
      <guid isPermaLink="false">f1a4599d-7501-4310-97ce-7e33eac42909</guid>
      <pubDate>Sat, 24 Apr 2010 16:37:14 +0900</pubDate>
      <description>&lt;a href=&quot;http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/24_n00b_tip__Wont_you_be_my_neighbor_files/rogers.jpg&quot;&gt;&lt;img src=&quot;http://6cycles.maisonikkoku.com/6Cycles/6cycles/Media/object071_1.jpg&quot; style=&quot;float:left; padding-right:10px; padding-bottom:10px; width:176px; height:136px;&quot;/&gt;&lt;/a&gt;Disclaimer: I mean no disrespect towards Fred McFeely Rogers.  I grew up watching his show and he taught me all about the patience and tolerance that to this day I still don’t have.  I only chose the photo because this post is all about...  wait for it... neighbors!&lt;br/&gt;&lt;br/&gt;Awhile back I started thinking about ways to average the 8 pixels neighboring some pixel.  I like this problem a lot because its quite basic/fundamental, but at the same time potentially has a number of interesting solutions.  If you’ve been doing SPU programming for any length of time, everything I say here will be completely obvious to you.  But for people new to the SPUs, hopefully it will keep them from writing some C++ SPU job that will probably work but isn’t anything you want iterating over millions of pixels. Or maybe it is.  Depends on your personality, I guess.&lt;br/&gt;&lt;br/&gt;Lets assume we don’t want to be extracting scalars from vectors or doing things one at a time.  Step one is to load the pixels in some way that makes life easy.  Notice the following:&lt;br/&gt;&lt;br/&gt;                               original vec: { 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }&lt;br/&gt;original vec rotated right 1 byte: { 31, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 }&lt;br/&gt;  original vec rotated left 1 byte: { 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 16 }&lt;br/&gt;&lt;br/&gt;With the exception of the bytes on the end, we now have each pixel vertically aligned with its left and right neighbor, meaning we can process 16 pixels at a time.  To fix up the ends, you can use shufb to not only simulate the effect of vector rotation, but also to steal a byte from the previous and next vectors.  Your left and right shuffle masks will look like this&lt;br/&gt;&lt;br/&gt;  left neighbor mask: { 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 }&lt;br/&gt;right neighbor mask: { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }&lt;br/&gt;&lt;br/&gt;I wont go into the finer details of SHUFB, but that first mask takes the last byte of vector 1 and the first 15 bytes of vector 2.  The second mask takes the last 15 bytes of vector 1 and the first byte of vector 2.  In other words, exactly what we want.  Now all we have to do is add.&lt;br/&gt;&lt;br/&gt;The clever among you will notice two things wrong with that last statement.  First of all, 255 + 255 + 255 = 253 ( because the result is stored back to a byte ).  Unfortunately, in this case we can’t do any obvious overflow voodoo magic with CG ( carry generate instruction, not the NVidia shader language ) because CG only works on int and unsigned int vectors.  &lt;br/&gt;&lt;br/&gt;The other problem is the SPU has no byte addition instruction.  If you don’t believe me, check out the code below.  If you use operator+ on two unsigned char vectors, you get the following lovely piece of code&lt;br/&gt;&lt;br/&gt;lqr     $21,0x2df40&lt;br/&gt;lqr     $20,0x2df00&lt;br/&gt;andhi   $3,$21,-256     &lt;br/&gt;ah      $19,$20,$21     &lt;br/&gt;ah      $17,$3,$20&lt;br/&gt;selb    $16,$17,$19,$18&lt;br/&gt;&lt;br/&gt;Thats not to say that there are no instructions that add bytes together.  There are, but they don’t work like you think.  SUMB takes every 4 consecutive bytes from vector 1, adds them together, and puts the sums into the even elements of the destination short vec.  SUMB also takes every 4 consecutive bytes from vector 2, adds them together, and puts the sums into the odd elements of the destination short vec.  You do add 8 numbers simultaneously, but you still have to shift and you still have to do a lot of weird packing/unpacking.  I haven’t totally given up on the idea of using this instruction, but I have temporarily moved on to trying other things.  I still think there is good potential here!&lt;br/&gt;&lt;br/&gt;edit: Using two shuffles you can get the data in a format where you can use SUMB to sum up all neighbors at once into shorts.  However, since the two operands to the shuffle need to be the vector and the vector below it, it doesn’t work so well for pixel windows that span horizontally across loaded vectors.  The problem is fixable but it would require an additional shuffle that depends on the result of the previous shuffle, or some rotates and selects.  For this reason, I’m not sure its the best way to go, but I’m not that clever so I could be missing something.&lt;br/&gt;&lt;br/&gt;One way to “solve” the problem is to realize that because we will be averaging 8 items, the weight for each item is 1/8, or item &gt;&gt; 3.  However, consider the scenario where you have a pixel surrounded by all 7’s.  In non-float land, 7/8 = 0, therefore making the average of all the pixels in the neighborhood 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 0.  Maybe thats OK if you know our data will always be &gt;= 8, but we still don’t have a fast way to sum up all those weighted pixels.  The obvious thing to do is quickly convert to short, add and shift, and put the result back in bytes but that seems excessive.  Oh well, if all else fails, we can always fall back to doing it that way. &lt;br/&gt;&lt;br/&gt;There is another interesting instruction called AVGB, meaning average bytes.  From the name alone, you know this one is going to be damn useful.  It takes 2 vectors as operands, adds corresponding elements + 1, and then shifts right by 1.   For example, lets say the first elements of the 2 input vectors are 11 and 20.  AVGB will give us ( 11 + 20 + 1 ) &gt;&gt; 1 or 16.  Because there are an even number of pixels in the neighborhood&lt;br/&gt;&lt;br/&gt;0  1  2&lt;br/&gt;3  P  4&lt;br/&gt;5  6  7&lt;br/&gt;&lt;br/&gt;we can express the average as avg( avg( avg( 0, 1 ), avg( 2, 3 ) ), avg( avg( 4, 5 ), avg( 6, 7 ) ) ).  So thats it, with a few shuffle masks and some AVGB instructions, we can average neighbors for 16 pixels at once simultaneously, and most importantly we can do it in a way that minimizes loads and random accesses.  &lt;br/&gt;&lt;br/&gt;I make no guarantees that this is the optimal way to do anything.  It all depends on your data and exactly what your needs are.  Maybe the precision is problematic for your use.  I am currently toying around with some alternative ideas that may or may not pan out.  If anything cool comes out of it, I’ll be sure to post.  Also, feel free to leave your thoughts and ideas in the comments section below.  I’d be curious what other techniques some of the more creative people in the industry are using.  &lt;br/&gt;&lt;br/&gt;One last thing to note: don’t forget to reuse your loaded vectors!  To process a particular vector, you also need the vector that comes before it and the one that comes after it.  On your next iteration, the current vector becomes previous, and the next vector becomes current.  You can even hide the cost of loading the next next vector somewhere in the previous iteration, giving you completely free loads ( your mileage may vary, depends on what else you do in your loop ).&lt;br/&gt;</description>
      <enclosure url="http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/24_n00b_tip__Wont_you_be_my_neighbor_files/rogers.jpg" length="39062" type="image/jpeg"/>
    </item>
    <item>
      <title>n00b tip: Psychic computing? ( part 2 )</title>
      <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/24_n00b_tip__Psychic_computing_%28_part_2_%29.html</link>
      <guid isPermaLink="false">0fc62fc5-b383-40f7-a317-8cdfdfdfb424</guid>
      <pubDate>Sat, 24 Apr 2010 13:36:07 +0900</pubDate>
      <description>&lt;a href=&quot;http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/24_n00b_tip__Psychic_computing_%28_part_2_%29_files/miss_cleo.jpg&quot;&gt;&lt;img src=&quot;http://6cycles.maisonikkoku.com/6Cycles/6cycles/Media/object059_1.jpg&quot; style=&quot;float:left; padding-right:10px; padding-bottom:10px; width:218px; height:165px;&quot;/&gt;&lt;/a&gt;Warning: this one is a little weird.  I can’t guarantee it will ever be useful for anything as-is, but maybe it will give you some good ideas.  Since this one is more about the overall idea than specific implementation, it’s going to be a little light on code examples.  Just know its about 95% shuffles and masks.&lt;br/&gt;&lt;br/&gt;Lets say you previously ran an edge detector on some image.   Now you want to go through rows and columns in the edge map to find line segments and their lengths.  Maybe you even want to make a list of these line segments.  One obvious way to do this would be to fill a unsigned int vector with the pixel values and rotate the vector to process them one by one, updating your state ( looking for line start, start new line, continue current line, end current line ).  Keeping with the theme of updating everything at once rather than one at a time, I’d like to show you an idea for something that may work better in some cases.&lt;br/&gt;&lt;br/&gt;Its going to work via lookup table.  Two of the best things about the SPU’s LS is the fast access times, and the fact that memory access don’t interfere with the memory accesses of other processors the way they would on a shared memory system.  In fact, lookup tables are so much fun on the SPUs, that most of the really good VRAM detiling SPU implementations I have seen use them.  If thats not enough proof for you, then I don’t know what is!&lt;br/&gt;&lt;br/&gt;The basic concept is that our algorithm needs to update a bunch of states.  All this info can be crammed into a vector, accessed in one load, and then unpacked.  Assuming we process 4 pixels at a time and only one bit is needed for the index, there can be at most 16 table entries ( think 1111 in binary ).  However, we also need one bit from the previous iteration, giving us 11111 or 32 entries.  Not so bad at all!&lt;br/&gt;&lt;br/&gt;So what are the possible patterns we could see in a vector?  Assuming that 1 means edge and 0 means no edge, here are a few possibilities:&lt;br/&gt;&lt;br/&gt;prev vector    current vector        meaning&lt;br/&gt;     ???1                1111               a line started in a prev vector is continued&lt;br/&gt;     ???1                1110               a line started in a prev vector is continued, but then ended&lt;br/&gt;     ???0                1111               a new line is started in this vector&lt;br/&gt;     ???0                0110               a new line in this vector, and then ended&lt;br/&gt;     ???0                1010               two new lines are started, and both are ended&lt;br/&gt;&lt;br/&gt;From the above table, its easy to see why our table index needs one bit from the previous vector.  Without it, you wouldn’t know if a leading 1 was starting a new line or continuing a previous line.  Also from the above table its easy to see that all you need to do to gt your table index is load in a vector from your edge map, do a GB ( gather bits ) to take the LSB from every word, and or with the extra bit corresponding to how the previous vector finished.  Obviously that extra bit should be initialized to zero at the start of every row or column since there is nothing to continue.&lt;br/&gt;&lt;br/&gt;The table entry format I chose was a vector of 8 unsigned shorts.  &lt;br/&gt;&lt;br/&gt;0) whether a line that was started in a previous vector ends in this vector ( 0xFFFF or 0x0000 )&lt;br/&gt;1) if the above is true, how much do we increment the line's end position before writing it out&lt;br/&gt;2) if we start a line in this vector that isnt ended here, it's known start position&lt;br/&gt;3) if we start a line in this vector that isnt ended here, increment it’s potential end position&lt;br/&gt;4) each vector can have up to 2 lines that both start and end here.  This is the first line's start&lt;br/&gt;5) the end position of the line started in #4&lt;br/&gt;	1)	same as #4 but for the second line&lt;br/&gt;	2)	same as #5 but for the second line&lt;br/&gt;&lt;br/&gt;Items 4 ~ 7 are there because you can start and end a maximum of two single pixel lines within one vector of 4 pixels.  Using all the info packed into the above table, we have everything we need to update the list of lines and the stare of the line detector itself.  After a few shuffles and selects, we can do predicated writes of the line start and end position to either a valid list address or some junk stack address, and we know how many lines to update the line count by.  If we started a line, we know to write it somewhere useful or if a line is being continued, we know to keep the original start pos.  Its all branchless and can be easily scheduled for nice results.&lt;br/&gt;&lt;br/&gt;So thats the overall description of what I did.  The actual code is divided up into 5 parts.  First we build the table index and get the vector at that position.  Next we use a bunch of shuffles and selects to extract some useful info from the loaded vector.  This includes all kinds of masks and magic values that I use to try and minimize the amount of work needed to be done in the inner loop.  Next, if we have any lines that finished ( not including the small 1 or 2 pixel lines handled by 4 ~ 7 ) we write them out to the line list.  Only after that do we then process the small lines.  Finally, we start any new lines if the current vector ends with an edge.  This nicely wraps the iteration up because its that last line start that becomes the extra bit for the next iteration’s lookup table index.</description>
      <enclosure url="http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/24_n00b_tip__Psychic_computing_%28_part_2_%29_files/miss_cleo.jpg" length="38957" type="image/jpeg"/>
    </item>
    <item>
      <title>n00b tip: Psychic computing? ( part 1 )</title>
      <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/17_n00b_tip__Psychic_computing.html</link>
      <guid isPermaLink="false">068c317b-4e1e-456c-adb1-5af1ce09e314</guid>
      <pubDate>Sat, 17 Apr 2010 18:21:11 +0900</pubDate>
      <description>&lt;a href=&quot;http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/17_n00b_tip__Psychic_computing_files/backtothefuturefiretrails.jpg&quot;&gt;&lt;img src=&quot;http://6cycles.maisonikkoku.com/6Cycles/6cycles/Media/object017_1.jpg&quot; style=&quot;float:left; padding-right:10px; padding-bottom:10px; width:176px; height:132px;&quot;/&gt;&lt;/a&gt;Have you ever found yourself in a situation where you want to go through a row of pixels and stop when you find a certain value? Maybe you want to count the number of pixels in the row until you hit that special value.  However, you quickly find that those powerful vector instructions don’t help so much when you have to process one pixel at a time.  After all, how can you know how to process pixel N + 1 until you have processed pixel N? The SPUs are great but they’re not psychic... or are they?&lt;br/&gt;&lt;br/&gt;OK, so you cant *exactly* see the future, but I can show you one possible way to process all your pixels at once in a branch-free way, and maybe introduce you to some interesting instructions you may not have used before.  So lets start with the important part of your inner loop.&lt;br/&gt;&lt;br/&gt;clgtbi $6, $35, 0&lt;br/&gt;gbb $29, $6&lt;br/&gt;shli $6, $29, 16&lt;br/&gt;clz $73, $6&lt;br/&gt;a $69, $73, $45&lt;br/&gt;clgt $27, $49, $73&lt;br/&gt;selb $45, $69, $45, $40&lt;br/&gt;or $40, $27, $40&lt;br/&gt;&lt;br/&gt;What we will do first is compare all 16 pixels loaded in the vector with the immediate value we’re looking for ( in this case zero ).  To be honest, only the first 4 pixels in $35 are real.  The other 12 pixels in the vector are set to all 1 for reasons you’ll see later.&lt;br/&gt;&lt;br/&gt;clgtbi $6, $35, 0&lt;br/&gt;&lt;br/&gt;The result of that compare is vector of 16 bytes, where each byte is either 0xFF if we found a zero pixel, or 0x00 otherwise.  The final 12 bytes will always be 0xFF.&lt;br/&gt;&lt;br/&gt;gbb $29, $6&lt;br/&gt;&lt;br/&gt;This is one of my favorite and most ( over ) used instructions.  It takes the least significant bit of every byte in a vector and gathers them into a 16 bit number.  That number is then stored in element 0 of an unsigned int vector.  The upper 16 bits of that word are all zero.&lt;br/&gt;&lt;br/&gt;shli $6, $29, 16&lt;br/&gt;&lt;br/&gt;This shifts the whole vector left by 16 bits, not just within each element but across the whole vector.  Its taking that 16 bit value we just calculated and getting rid of the leading 16 zeros in the word.&lt;br/&gt;&lt;br/&gt;clz $73, $6&lt;br/&gt;&lt;br/&gt;Its a good thing we got rid of those annoying 16 bits because we just counted the leading zeros in the word.  This becomes the count of how many pixels in this particular vector until you hit the value you are looking for.  Remember how I said we stuffed 1,1,1,1,1,1,1,1,1,1,1,1 into the unused 12 bytes of the pixel vector? Here is why.  If the rest of the vector had zero’s in it, the leading bit count could have gone on way past where it should have.  Stuffing some non-zero value in means clz stops as soon as we get past the data we are really interested in.  Also, as a result, if the count is equal to 4, then we know the 4 pixels didn’t contain the value we were looking for.  &lt;br/&gt;&lt;br/&gt;For example, if we had pixel values { 0x86, 0x75, 0x00, 0x99, ... }, our compare mask in $6 would be { 0x00, 0x00, 0xFF, 0x00, ... } and the gathered bits would be 0010.  Counting the leading zeroes gives us 2, meaning that we got through 2 pixels before finding that zero pixel we were looking for. &lt;br/&gt;&lt;br/&gt;a $69, $73, $45&lt;br/&gt;&lt;br/&gt;Add the count for this vector to the count for the previously processed vector and store it somewhere temporary.  Its very important that we don’t overwrite the previous count because we are not sure if we want to keep the new count yet.&lt;br/&gt;&lt;br/&gt;clgt $27, $49, $73&lt;br/&gt;&lt;br/&gt;At this point $49 should have { 4, 0, 0, 0 } in it.  This is doing what we said before by comparing with 4.  If the leading zero count is equal to 4, then the value we are looking for isnt in the 4 pixels we looked at.  If it’s less than 4 then it becomes the number of pixels in the vector before finding the zero.  &lt;br/&gt;&lt;br/&gt;selb $45, $69, $45, $40&lt;br/&gt;&lt;br/&gt;$40 is a mask indicating whether or not we previously ( in a previous iteration ) found a zero pixel.  If its all 0xFF, then it means that we found the pixel we were looking for in some previous vector and therefore should not be changing the count.  If its all 0x00, then we are still looking for the zero pixel.  Thats why we are using this mask to either select the previous count or the new count.  Once we find that zero pixel, we never want to update the count again.&lt;br/&gt;&lt;br/&gt;or $40, $27, $40&lt;br/&gt;&lt;br/&gt;Take the old mask and the new mask and or them together.  This is saying that is we found the zero pixel in a previous iteration or if we found it on this iteration, then this mask will stop us from every updating the count again.&lt;br/&gt;&lt;br/&gt;So, thats it.  Is it optimal?  Maybe yes some cases, no in others.  Is there room for improvement?  Sure, of course!  Stay tuned for part two when I talk about taking advantage of LS and using lookup tables.&lt;br/&gt;&lt;br/&gt;edit: Why did I only process 4 pixels at a time when I could have easily done 16?  The answer is bad data.  When loading in data, the pixels could come from a diagonal ray meaning that there is a bit of overhead in loading and masking and shuffling.  Depending on your specific problem, there are clever ways around this and the entire algorithm can be dramatically sped up, but I will have to talk about this in a future post.</description>
      <enclosure url="http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/17_n00b_tip__Psychic_computing_files/backtothefuturefiretrails.jpg" length="21560" type="image/jpeg"/>
    </item>
    <item>
      <title>Works better than the real thing...</title>
      <link>http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/17_Day_of_longboarding.html</link>
      <guid isPermaLink="false">7e2764bc-e59f-42f0-b0f2-19abdb761cd5</guid>
      <pubDate>Sat, 17 Apr 2010 13:58:40 +0900</pubDate>
      <description>My wife works all day Saturday, which leaves me some time to work on non-work related projects.  Since the new firmware “upgrade” forced me to buy a second PS3 for gaming, I thought I’d hack together a little something to make Linux dev easier on the old PS3.  Its fairly basic and limited, but it allows me to compile and run stuff on the PS3 without having a keyboard or mouse attached.  I can even map the build and run to Xcode buttons.  Once again, nothing impressive but it should make iteration a little easier.  Now if only I had made “kill elf” and “execute elf” opposite commands...</description>
    </item>
  </channel>
</rss>
