XNA Creators Club Online
Page 1 of 1 (3 items)
Sort Posts: Previous Next

Compute Shader samples

Last post 21/10/2009 21:17 by BarnacleJunior. 2 replies.
  • 19/10/2009 20:18

    Compute Shader samples

    Is there a schedule for the next SDK release, or even minor release with just a bug-fixed fxc?

    More importantly, when can we expect useful Compute Shader samples?  I've spent all weekend trying to write a radix sort for cs_5_0, mostly going off the Garland or Harris papers, and drawing more and more on the CUDPP library.  But it's a lot of intricate code in the sometimes-hard-to-follow CUDA idiom.  As discussions of bank conflicts and the like make up a big part of the CUDA literature, can we hope for some documentation that has at least rules-of-thumb for dealing with R800 and GT300 architectures in D3D11 in specific?

    .sean

  • 21/10/2009 6:37 In reply to

    Re: Compute Shader samples

    Hi Sean,
    What sort of Compute Shader samples did you want to see? Currently the August 2009 SDK has these Compute samples:

    1) Array A + Array B
    2) BC6H BC7 encoding/decoding using CS 4.0
    3) Bitonic sort using CS 4.0
    4) NBody systems
    5) Order Independent Transparency
    6) Adaptive Tessellation using CS 4.0
    7) HDR Tone Mapping

    As for a bug-fixed fxc, are you hitting a blocking issue? What is the bug?

    Thanks!
    -Sebastian
  • 21/10/2009 21:17 In reply to

    Re: Compute Shader samples

    As for samples, I have in mind some more powerful things - like the radix sort in CUDPP.  Bitonic sort really isn't appropriate for very large arrays.  I think the compute shader samples in the Aug SDK are good enough to learn the mechanics of compute shader HLSL and bindings to the API, but it doesn't do anything to actually teach GPU parallel programming.  CS_5_0 profile will enable much better samples I'm sure, and I expect that lots of strip mining kinds of operations will show up (Volkov-style large matrix mul, etc, which are pretty easy as you just are vectorizing well-known routines), but things like radix sort, merge sort, FFT, &c., are pretty difficult to figure out on one's own.

    The Order-Independent-Transparency demo does have a prefix sum (which you need for sorting as well as for stream compaction), but I took a cursory glance and it looks like it is highly sequential and has a horrible work efficiency.  On the last pass, it appears you have a Dispatch with a single group with a single thread that loops over half the pixels of the entire framebuffer.  I've been puzzling over the Mark Harris/Mike Garland radix sort that leverages a fast parallel prefix sum, but the code is of course in CUDA-ese and very hard to understand.  Getting this working for Compute Shaders would make a useful demo, as you can take a million or more particles, quantize them into buckets, interleave the x, y, z coordinates for z-order indexing, and radix sort to put the particles in an octree, each frame, and entirely in GPU.

    For bugs, I've reported two and seen others on this forum: double precision inside dynamic loops always crashes the compiler, texture sampling instructions dropped entirely when cached into groupshared memory, requiring literals when indexing arrays inside structured buffers (but not broken for UAVs).

    I would like to say that the D3D11 team is doing a pretty good job.  This is exciting tech.  And as inchoate as the SDK is, it sure beats what's available for another super-hyped API (which still has no Windows ICD!). 

    .sean

Page 1 of 1 (3 items) Previous Next