I'm trying to use Kahan summation to add up a bunch of numbers inside a compute shader - it's good for large matrix mul or force interactions and things like that. You don't have to use a double for accumulation, as the lost residuals are kept in a temp and added back in during the next increment:
http://en.wikipedia.org/wiki/Kahan_summation
Unfortunately the HLSL compiler is totally optimizing out this operation:
void Accumulate(inout float4 sum, float4 source, inout float4 compensation) {
float4 y = source - compensation;
float4 t = sum + y;
compensation = (t - sum) - y;
sum = t;
}
Buffer<float4> input;
RWBuffer<float4> output;
[numthreads(1, 1, 1)]
void AddThemUp(uint3 groupID : SV_GroupID) {
float4 sum = float4(0, 0, 0, 0);
#ifdef ACCUMULATE_KAHAN
float4 compensation = float4(0, 0, 0, 0);
#endif
[loop]
for(uint i = groupID.x; i < 16384; ++i) {
float4 add = input[i];
#ifdef ACCUMULATE_KAHAN
Accumulate(sum, add, compensation);
#else
sum += add;
#endif
}
output[groupID.x] = sum;
}
Unless I compile with /Od, the same assembler is generated for both with and without ACCUMULATE_KAHAN:
cs_5_0
dcl_globalFlags refactoringAllowed
dcl_resource_buffer (float,float,float,float) t0
dcl_uav_typed_buffer (float,float,float,float) u0
dcl_input vThreadGroupID.x
dcl_temps 3
dcl_thread_group 1, 1, 1
mov r0.xyzw, l(0,0,0,0)
mov r1.x, vThreadGroupID.x
loop
uge r1.y, r1.x, l(0x00004000)
breakc_nz r1.y
ld_indexable(buffer)(float,float,float,float) r2.xyzw, r1.xxxx, t0.xyzw
add r0.xyzw, r0.xyzw, r2.xyzw
iadd r1.x, r1.x, l(1)
endloop
store_uav_typed u0.xyzw, vThreadGroupID.xxxx, r0.xyzw
ret
// Approximately 11 instruction slots used
I tried using the precise qualifier, but it appears that even a single use of that qualifier anywhere infects the entire shader, eliminating all MADs and replacing them with MULs and ADDs. It doesn't simply turn off algebraic optimization. When I use precise on the float4 t term inside the Accumulate scope, fxc generates this:
cs_5_0
dcl_globalFlags refactoringAllowed
dcl_resource_buffer (float,float,float,float) t0
dcl_uav_typed_buffer (float,float,float,float) u0
dcl_input vThreadGroupID.x
dcl_temps 6
dcl_thread_group 1, 1, 1
mov [precise] r0.xyzw, l(0,0,0,0)
mov [precise] r1.xyzw, l(0,0,0,0)
mov [precise(x)] r2.x, vThreadGroupID.x
loop
uge [precise(y)] r2.y, r2.x, l(0x00004000)
breakc_nz r2.y
ld_indexable [precise](buffer)(float,float,float,float) r3.xyzw, r2.xxxx, t0.xyzw
add [precise] r3.xyzw, -r1.xyzw, r3.xyzw
add [precise] r4.xyzw, r0.xyzw, r3.xyzw
add [precise] r5.xyzw, -r0.xyzw, r4.xyzw
add [precise] r1.xyzw, -r3.xyzw, r5.xyzw
iadd [precise(x)] r2.x, r2.x, l(1)
mov [precise] r0.xyzw, r4.xyzw
endloop
store_uav_typed u0.xyzw, vThreadGroupID.xxxx, r0.xyzw
ret
// Approximately 16 instruction slots used
Even the loop iterator is marked [precise]. This really doesn't agree with the description of the qualifier in the chm. In my actual code (a force accumulation shader), using precise just inside the Accumulate function grows the instruction slot count from 994 to 1326, since all the fused MADs go to MUL, ADD.
It would be really useful if there was maybe a function modifier [fullalgebra] or something that turned off algebraic optimizations just for the scope of one function, without using precise or removing MADs.
Thanks,
.sean