XNA Creators Club Online
Page 1 of 1 (2 items)
Sort Posts: Previous Next

compensated summation and precise qualifier

Last post 22/10/2009 18:51 by John Rapp. 1 replies.
  • 22/10/2009 17:05

    compensated summation and precise qualifier

    I'm trying to use Kahan summation to add up a bunch of numbers inside a compute shader - it's good for large matrix mul or force interactions and things like that.  You don't have to use a double for accumulation, as the lost residuals are kept in a temp and added back in during the next increment:
    http://en.wikipedia.org/wiki/Kahan_summation

    Unfortunately the HLSL compiler is totally optimizing out this operation:

    void Accumulate(inout float4 sum, float4 source, inout float4 compensation) {
        float4 y = source - compensation;
        float4 t = sum + y;
        compensation = (t - sum) - y;
        sum = t;
    }

    Buffer<float4> input;
    RWBuffer<float4> output;

    [numthreads(1, 1, 1)]
    void AddThemUp(uint3 groupID : SV_GroupID) {
        float4 sum = float4(0, 0, 0, 0);
    #ifdef ACCUMULATE_KAHAN
        float4 compensation = float4(0, 0, 0, 0);
    #endif
        [loop]
        for(uint i = groupID.x; i < 16384; ++i) {
            float4 add = input[i];
    #ifdef ACCUMULATE_KAHAN
            Accumulate(sum, add, compensation);
    #else
            sum += add;
    #endif    
        }
        output[groupID.x] = sum;
    }


    Unless I compile with /Od, the same assembler is generated for both with and without ACCUMULATE_KAHAN:

    cs_5_0
    dcl_globalFlags refactoringAllowed
    dcl_resource_buffer (float,float,float,float) t0
    dcl_uav_typed_buffer (float,float,float,float) u0
    dcl_input vThreadGroupID.x
    dcl_temps 3
    dcl_thread_group 1, 1, 1
    mov r0.xyzw, l(0,0,0,0)
    mov r1.x, vThreadGroupID.x
    loop
      uge r1.y, r1.x, l(0x00004000)
      breakc_nz r1.y
      ld_indexable(buffer)(float,float,float,float) r2.xyzw, r1.xxxx, t0.xyzw
      add r0.xyzw, r0.xyzw, r2.xyzw
      iadd r1.x, r1.x, l(1)
    endloop
    store_uav_typed u0.xyzw, vThreadGroupID.xxxx, r0.xyzw
    ret
    // Approximately 11 instruction slots used


    I tried using the precise qualifier, but it appears that even a single use of that qualifier anywhere infects the entire shader, eliminating all MADs and replacing them with MULs and ADDs.  It doesn't simply turn off algebraic optimization.  When I use precise on the float4 t term inside the Accumulate scope, fxc generates this:

    cs_5_0
    dcl_globalFlags refactoringAllowed
    dcl_resource_buffer (float,float,float,float) t0
    dcl_uav_typed_buffer (float,float,float,float) u0
    dcl_input vThreadGroupID.x
    dcl_temps 6
    dcl_thread_group 1, 1, 1
    mov [precise] r0.xyzw, l(0,0,0,0)
    mov [precise] r1.xyzw, l(0,0,0,0)
    mov [precise(x)] r2.x, vThreadGroupID.x
    loop
      uge [precise(y)] r2.y, r2.x, l(0x00004000)
      breakc_nz r2.y
      ld_indexable [precise](buffer)(float,float,float,float) r3.xyzw, r2.xxxx, t0.xyzw
      add [precise] r3.xyzw, -r1.xyzw, r3.xyzw
      add [precise] r4.xyzw, r0.xyzw, r3.xyzw
      add [precise] r5.xyzw, -r0.xyzw, r4.xyzw
      add [precise] r1.xyzw, -r3.xyzw, r5.xyzw
      iadd [precise(x)] r2.x, r2.x, l(1)
      mov [precise] r0.xyzw, r4.xyzw
    endloop
    store_uav_typed u0.xyzw, vThreadGroupID.xxxx, r0.xyzw
    ret
    // Approximately 16 instruction slots used

    Even the loop iterator is marked [precise].  This really doesn't agree with the description of the qualifier in the chm.  In my actual code (a force accumulation shader), using precise just inside the Accumulate function grows the instruction slot count from 994 to 1326, since all the fused MADs go to MUL, ADD.

    It would be really useful if there was maybe a function modifier [fullalgebra] or something that turned off algebraic optimizations just for the scope of one function, without using precise or removing MADs.

    Thanks,

    .sean


  • 22/10/2009 18:51 In reply to

    Re: compensated summation and precise qualifier

    Marking a value precise will ensure that any values used to calculate that value are also marked precise, including any values used in the surrounding flow control. If you just want the compiler to use mad instructions on your precise values, you can use the mad intrinsic. The compiler will avoid matching mad instructions for you because on some hardware the fused mad operation can result in slightly higher precision, and can result code that varies in precision depending on its surroundings.

    As for /Od optimizing out your code, I'll investigate that, but in general, precise is meant for repeatable execution of code, not for high-performance code or for selective disabling of optimization in a given block of code.
Page 1 of 1 (2 items) Previous Next