-
|
|
Synchronization methods in Compute Shader
|
Hi,
Can anyone please tell me what are the details and differences among following synchronization methods available in Compute Shader in DirectX 11.
1. AllMemoryBarrier 2. AllMemoryBarrierWithGroupSync 3. DeviceMemoryBarrier 4. DeviceMemoryBarrierWithGroupSync 5. GroupMemoryBarrier 6. GroupMemoryBarrierWithGroupSync
Thanks & Regards, Nirav Shah
|
|
-
|
|
Re: Synchronization methods in Compute Shader
|
The primary differences are what the compiler/driver can't do in each scenario.
1) no memory accesses may be moved across this barrier, essentially this means that until you hit the barrier, there is no guarantee that what you wrote to memory will be visible to other threads
2) similar to 1, but this is a threading barrier as well, all threads must hit this statement before any can continue
3) no UAV accesses may be moved across this barrier
4) 3 + threading barrier
5) no accesses to group shared memory may be moved across this barrier
6) 5 + threading barrier
1, 3, and 5 really aren't all that interesting by themselves, because without a thread sync barrier, there is no guarantee that whoever wrote the data you'd like to see will have written it yet. So these are mostly to help you work around compiler/driver bugs that don't recognize a dependency and are ordering your memory accesses incorrectly.
|
|
-
|
|
Re: Synchronization methods in Compute Shader
|
Thanks for the clarification, John. Still, there remains some doubts in it.
John Rapp:1) no memory accesses may be moved across this barrier, essentially this means that until you hit the barrier, there is no guarantee that what you wrote to memory will be visible to other threads
You mean, until the execution is reached to this call, there is no guarantee that whatever is written to some shared memory may or may not be visible to other threads? Then what happens if I dont use this call at all?? Throughout the execution, I will not be sure that any thread reading from a shared memory, is it having proper value or not? This means, I need to put this call after each and every write to the shared memory. Is this true?
John Rapp:3) no UAV accesses may be moved across this barrier
First, is device memory means UAV?? I had an understanding that UAVs are just a mechanism to access some memory (Unordered Access View). And by the name of the method, it seemed that this function is to be used to have synchronization over "Device" memory as in video memory or graphics card memory, typically not the RAM/Main memory. Is there any way to specify while allocation of memory whether it is allocated in main memory or graphics memory?
Thanks & Regards, Nirav.
|
|
-
|
|
Re: Synchronization methods in Compute Shader
|
Yes, but you have to remember that it goes both ways. There is no guarantee that anything you wrote to shared memory will be visible to other threads, and there is no guarantee that you'll be able to see anything another thread wrote or has yet to write.
If you don't use these calls at all, then you're specifying that all of your threads are independent of each other and don't need to share data. This doesn't mean you need to put them after every write to shared memory, though. Generally the programs we see have various phases, where threads act individually, then synchronize with each other, and then continue individually again.
The only device memory that can be both read and written (required for these synchronization primitives to be necessary) is memory in a UAV. I don't believe we have any mechanism for the GPU to explicitly use a system memory surface as a writeable surface, though in certain cases, system memory may be treated as device memory (such as in some integrated parts).
|
|
-
|
|
Re: Synchronization methods in Compute Shader
|
Thanks a lot John, for all the clarifiactions. :)
I would like to go further deep in to synchronization mechanisms used by compute shader. While accessing shared memory (read or write) what is the granularity level? I mean, when I try to read or write one byte from or to a shared memory, does whole shared memory is kindda locked for it or only the byte I am accessing is locked? Or is there a block of bytes which is locked? If so, what is the size of the block?
Moreover, are there concepts like shared locks for reads and exclusive locks for writes as in DBMSes?
Thanks & Regards, Nirav Shah.
PS: If I am bothering too much then you can also give me a good pointer which explains synchronization mechanisms used by compute shader and DirectX 11. :)
|
|
-
|
|
Re: Synchronization methods in Compute Shader
|
The granularity is 4-bytes, HLSL doesn't currently directly expose anything smaller than this (though you can obviously use bitwise operations to insert/extract a byte from a 4-byte block).
As for your second question, there are no lock concepts. Locks are a very bad model when you start to have many hundreds of threads trying to access the lock simultaneously (1000x slowdown is not unlikely). Instead, your algorithm should organize the threads in such a way that they can do their writes without stepping on each other, or make use of the interlocked intrinsics to calculate a spot that will not step on any other threads (for example, incrementing an integer that reserves a unique spot for your data).
|
|
-
|
|
Re: Synchronization methods in Compute Shader
|
Okey, so there is no locking done by DirectX itself. The programmer should be aware of this fact and write the code accordingly, got it.
Alright, then how about having any mechanism to control access critical sections(code where shared read and write happens) inside the code? I guess, I haven't come across any intrinsic functions like "Test and set" or "Semaphores" which allow to do so. As, there might be cases where one thread is writing and the other is reading the same location. In order to give exclusive access to memory locations to different threads may result in lot more memory consumption and copy operations done.
Thanks & Regards, Nirav.
|
|
-
|
|
Re: Synchronization methods in Compute Shader
|
We considered adding language support for critical sections, but again you can run into the 1000x slowdown case, so we decided against it. If it really is something that you only want one thread to do, then we suggest using the if( threadid == N ) trick.
|
|
-
|
|
Re: Synchronization methods in Compute Shader
|
Okey. I'll do that.
Thanks a ton for all these information... :)
Regards, Nirav.
|
|
|