CUDA __syncthreads() usage within a warp - parallel-processing

If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp?
Note: No extra threads or blocks, just a single warp for the kernel.
Example code:
shared _voltatile_ sdata[16];
int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];

Updated with more information about using volatile
Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the __syncthreads(), a practice known as "warp-synchronous programming". However, there are a few things to look out for.
Remember that a compiler will assume that it can optimise providing the intra-thread semantics remain correct, including delaying stores to memory where the data can be kept in registers. __syncthreads() acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Using volatile causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future)
Technically, you should always use __syncthreads() to conform with the CUDA Programming Model
The warp size is and always has been 32, but you can:
At compile time use the special variable warpSize in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version)
At run time use the warpSize field of the cudaDeviceProp struct (documented in the CUDA Reference Manual)
Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.

You still need __syncthreads() even if warps are being executed in parallel. The actual execution in hardware may not be parallel because the number of cores within a SM (Stream Multiprocessor) can be less than 32. For example, GT200 architecture has 8 cores in each SM, so you can never be sure all threads are in the same point in the code.

Related

Why cannot a kernel be launched with the reason of too many register use when there is a register spilling mechanism?

1) When does a kernel start to spill registers to local memory?
2) When there is not enough registers, how does the CUDA runtime decide to not launch a kernel and throws too many resources requested error? How many registers are enough to launch a kernel?
3) Since there is a register spilling mechanism, shouldn't all CUDA kernels be launched even if there are not enough registers?
1) When does a kernel start to spill registers to local memory?
This is entirely under control of the compiler. It is not performed by the runtime, and there are no dynamic runtime decisions about it. When your code reaches the point of a spill, it means that the compiler has inserted an instruction like:
STL [R0], R1
In this case, R1 is being stored to local memory, the local memory address given in R0. This would be a spill store. (After that instruction, R1 could be used for/loaded with something else.) The compiler knows when it has done this, of course, and so it can report the number of spill loads and spill stores it has chosen to use/make. You can get this information (along with register usage, and other information) using the -Xptxas=-v compiler switch.
The compiler (unless you restrict it, see below) makes decisions about register usage primarily focused on performance, paying otherwise less attention to how many registers are actually used. The first priority is performance.
2) When there is not enough registers, how does the CUDA runtime decide to not launch a kernel and throws too many resources requested error? How many registers are enough to launch a kernel?
At compile-time, when your kernel code is being compiled, the compiler has no idea how it will be launched. It has no idea what your launch configuration will be like (number of blocks, number of threads per block, amount of dynamically allocated shared memory, etc) In fact the compilation process mostly proceeds as if the thing being compiled is a single thread.
During compilation, the compiler makes a bunch of static decisions about register assignments (how and where registers will be used). CUDA has binary utilities that can help with understanding this. Register assignments don't change at runtime, are not in any way dynamic, and therefore are entirely determined at compile time. Therefore, at the completion of compilation for a given device code function, it is generally possible to determine how many registers are needed. The compiler includes this information in the binary compiled object.
At runtime, at the point of kernel launch, the CUDA runtime now knows:
How many registers (per thread) are needed for a given kernel
What device we are running on, and therefore what the aggregate limits are
What the launch configuration is (blocks, threads)
Assembling these 3 pieces of information means the runtime can immediately know if there is or will be enough "register space" for the launch. Roughly speaking, the pass/fail arithmetic is if the launch would satisfy this inequality:
registers_per_thread*threads_per_block <= max_registers_per_multiprocessor
There is granularity to be considered in this equation as well. Registers are often allocated in groups of 2 or 4 at runtime, i.e. the registers_per_thread quantity may need to be rounded up to the next whole-number multiple of something like 2 or 4, before the inequality test is applied. The registers_per_thread quantity is ascertained by the compiler as already described. The threads_per_block quantity comes from your kernel launch configuration. The max_registers_per_multiprocessor quantity is machine-readable (i.e. it is a function of the GPU you are running on). You can see how to retrieve that quantity yourself if you wish by studying the deviceQuery CUDA sample code.
3) Since there is a register spilling mechanism, shouldn't all CUDA kernels be launched even if there are not enough registers?
I reiterate that the register assignment (and register spill decisions) is/are entirely a static compile-time process. No runtime decisions or alterations are made. The register assignment is entirely inspectable from the compiled code. Therefore, since no adjustments can be made at runtime, no changes could be made to allow an arbitrary launch. Any such change would require recompilation of the code. While this might be theoretically possible, it is not currently implemented in CUDA. Furthermore, it has the possibility to lead to both variable and perhaps unpredictable behavior (in performance) so there might be reasons not to do it.
Its possible to make all kernels "launchable" (with respect to register limitations) by suitably restricting the compiler's choices about register assignment. __launch_bounds__ and the compiler switch -maxrregcount are a couple ways to achieve this. CUDA provides both an occupancy calculator as well as an occupancy API to help with this process.

Why we do not use barriers in User space

I am reading about memory barriers and what I can summarize is that they prevent instruction re-ordering done by compilers.
So in User space memory lets say I have
b = 0;
main(){
a = 10;
b = 20;
c = add(a,b);
}
Can the compiler reorder this code so that b = 20 assignment happens after c = add() is called.
Why we do not use barriers in this case ? Am I missing some fundamental here.
Does Virtual memory is exempted from any re ordering ?
Extending the Question further:
In Network driver:
1742 /*
1743 * Writing to TxStatus triggers a DMA transfer of the data
1744 * copied to tp->tx_buf[entry] above. Use a memory barrier
1745 * to make sure that the device sees the updated data.
1746 */
1747 wmb();
1748 RTL_W32_F (TxStatus0 + (entry * sizeof (u32)),
1749 tp->tx_flag | max(len, (unsigned int)ETH_ZLEN));
1750
When he says devices see the updated data... How to relate this with the multi threaded theory for usage of barriers.
Short answer
Memory barriers are used less frequently in user mode code than kernel mode code because user mode code tends to use higher level abstractions (for example pthread synchronization operations).
Additional details
There are two things to consider when analyzing the possible ordering of operations:
What order the thread that is executing the code will see the operations in
What order other threads will see the operations in
In your example the compiler cannot reorder b=20 to occur after c=add(a,b) because the c=add(a,b) operation uses the results of b=20. However, it may be possible for the compiler to reorder these operations so that other threads see the memory location associated with c change before the memory location associated with b changes.
Whether this would actually happen or not depends on the memory consistency model that is implemented by the hardware.
As for when the compiler might do reordering you could imagine adding another variable as follows:
b = 0;
main(){
a = 10;
b = 20;
d = 30;
c = add(a,b);
}
In this case the compiler would be free to move the d=30 assignment to occur after c=add(a,b).
However, this entire example is too simplistic. The program doesn't do anything and the compiler can eliminate all the operations and does not need to write anything to memory.
Addendum: Memory reordering example
In a multiprocessor environment multiple threads can see memory operations occur in different orders. The Intel Software Developer's Manual has some examples in Volume 3 section 8.2.3. I've copied a screenshot below that shows an example where loads and stores can be reordered.
There is also a good blog post that provides some more detail about this example.
The thread running the code will always act as if the effects of the source lines of its own code happened in program order. This is as if rule is what enables most compiler optimizations.
Within a single thread, out-of-order CPUs track dependencies to give a thread the illusion that all its instructions executed in program order. The globally-visible (to threads on other cores) effects may be seen out-of-order by other cores, though.
Memory barriers (as part of locking, or on their own) are only needed in code that interacts with other threads through shared memory.
Compilers can similarly do any reordering / hoisting they want, as long as the results are the same. The C++ memory model is very weak, so compile-time reordering is possible even when targeting an x86 CPU. (But of course not reordering that produces different results within the local thread.) C11 <stdatomic.h> and the equivalent C++11 std::atomic are the best way to tell the compiler about any ordering requirements you have for the global visibility of operations. On x86, this usually just results in putting store instructions in source order, but the default memory_order_seq_cst needs an MFENCE on each store to prevent StoreLoad reordering for full sequential consistency.
In kernel code, memory barriers are also common to make sure that stores to memory-mapped I/O registers happen in a required order. The reasoning is the same: to order the globally-visible effects on memory of a sequence of stores and loads. The difference is that the observer is an I/O device, not a thread on another CPU. The fact that cores interact with each other through a cache coherency protocol is irrelevant.
The compiler cannot reorder (nor can the runtime or the cpu) so that b=20 is after the c=add()since that would change the semantics of the method and that is not permissible.
I would say that for the compiler (or runtime or cpu) to act as you describe would make the behaviour random, which would be a bad thing.
This restriction on reordering applies only within the thread executing the code. As #GabrielSouthern points out, the ordering of the stores becoming globally visible is not guaranteed, if a, b, and c are all global variables.

Atomicity, Volatility and Thread Safety in Windows

It's my understanding of atomicity that it's used to make sure a value will be read/written in whole rather than in parts. For example, a 64-bit value that is really two 32-bit DWORDs (assume x86 here) must be atomic when shared between threads so that both DWORDs are read/written at the same time. That way one thread can't read half variable that's not updated. How do you guarantee atomicity?
Furthermore it's my understanding that volatility does not guarantee thread safety at all. Is that true?
I've seen it implied many places that simply being atomic/volatile is thread-safe. I don't see how that is. Won't I need a memory barrier as well to ensure that any values, atomic or otherwise, are read/written before they can actually be guaranteed to be read/written in the other thread?
So for example let's say I create a thread suspended, do some calculations to change some values to a struct available to the thread and then resume, for example:
HANDLE hThread = CreateThread(NULL, 0, thread_entry, (void *)&data, CREATE_SUSPENDED, NULL);
data->val64 = SomeCalculation();
ResumeThread(hThread);
I suppose this would depend on any memory barriers in ResumeThread? Should I do an interlocked exchange for val64? What if the thread were running, how does that change things?
I'm sure I'm asking a lot here but basically what I'm trying to figure out is what I asked in the title: a good explanation for atomicity, volatility and thread safety in Windows. Thanks
it's used to make sure a value will be read/written in whole
That's just a small part of atomicity. At its core it means "uninterruptible", an instruction on a processor whose side-effects cannot be interleaved with another instruction. By design, a memory update is atomic when it can be executed with a single memory-bus cycle. Which requires the address of the memory location to be aligned so that a single cycle can update it. An unaligned access requires extra work, part of the bytes written by one cycle and part by another. Now it is not uninterruptible anymore.
Getting aligned updates is pretty easy, it is a guarantee provided by the compiler. Or, more broadly, by the memory model implemented by the compiler. Which simply chooses memory addresses that are aligned, sometimes intentionally leaving unused gaps of a few bytes to get the next variable aligned. An update to a variable that's larger than the native word size of the processor can never be atomic.
But much more important are the kind of processor instructions you need to make threading work. Every processor implements a variant of the CAS instruction, compare-and-swap. It is the core atomic instruction you need to implement synchronization. Higher level synchronization primitives, like monitors (aka condition variables), mutexes, signals, critical sections and semaphores are all built on top of that core instruction.
That's the minimum, a processor usually provide extra ones to make simple operations atomic. Like incrementing a variable, at its core an interruptible operation since it requires a read-modify-write operation. Having a need for it be atomic is very common, most any C++ program relies on it for example to implement reference counting.
volatility does not guarantee thread safety at all
It doesn't. It is an attribute that dates from much easier times, back when machines only had a single processor core. It only affects code generation, in particular the way a code optimizer tries to eliminate memory accesses and use a copy of the value in a processor register instead. Makes a big, big difference to code execution speed, reading a value from a register is easily 3 times faster than having to read it from memory.
Applying volatile ensures that the code optimizer does not consider the value in the register to be accurate and forces it to read memory again. It truly only matters on the kind of memory values that are not stable by themselves, devices that expose their registers through memory-mapped I/O. It has been abused heavily since that core meaning to try to put semantics on top of processors with a weak memory model, Itanium being the most egregious example. What you get with volatile today is strongly dependent on the specific compiler and runtime you use. Never use it for thread-safety, always use a synchronization primitive instead.
simply being atomic/volatile is thread-safe
Programming would be much simpler if that was true. Atomic operations only cover the very simple operations, a real program often needs to keep an entire object thread-safe. Having all its members updated atomically and never expose a view of the object that is partially updated. Something as simple as iterating a list is a core example, you can't have another thread modifying the list while you are looking at its elements. That's when you need to reach for the higher-level synchronization primitives, the kind that can block code until it is safe to proceed.
Real programs often suffer from this synchronization need and exhibit Amdahls' law behavior. In other words, adding an extra thread does not actually make the program faster. Sometimes actually making it slower. Whomever finds a better mouse-trap for this is guaranteed a Nobel, we're still waiting.
In general, C and C++ don't give any guarantees about how reading or writing a 'volatile' object behaves in multithreaded programs. (The 'new' C++11 probably does since it now includes threads as part of the standard, but tradiationally threads have not been part of standard C or C++.) Using volatile and making assumptions about atomicity and cache-coherence in code that's meant to be portable is a problem. It's a crap-shoot as to whether a particular compiler and platform will treat accesses to 'volatile' objects in a thread-safe way.
The general rule is: 'volatile' is not enough to ensure thread safe access. You should use some platform-provided mechanism (usually some functions or synchronisation objects) to access thread-shared values safely.
Now, specifically on Windows, specifically with the VC++ 2005+ compiler, and specifically on x86 and x64 systems, accessing a primitive object (like an int) can be made thread-safe if:
On 64- and 32-bit Windows, the object has to be a 32-bit type, and it has to be 32-bit aligned.
On 64-bit Windows, the object may also be a 64-bit type, and it has to be 64-bit aligned.
It must be declared volatile.
If those are true, then accesses to the object will be volatile, atomic and be surrounded by instructions that ensure cache-coherency. The size and alignment conditions must be met so that the compiler makes code that performs atomic operations when accessing the object. Declaring the object volatile ensures that the compiler doesn't make code optimisations related to caching previous values it may have read into a register and ensures that code generated includes appropriate memory barrier instructions when it's accessed.
Even so, you're probably still better off using something like the Interlocked* functions for accessing small things, and bog standard synchronisation objects like Mutexes or CriticalSections for larger objects and data structures. Ideally, get libraries for and use data structures that already include appropriate locks. Let your libraries & OS do the hard work as much as possible!
In your example, I expect you do need to use a thread-safe access to update val64 whether the thread is started yet or not.
If the thread was already running, then you would definitely need some kind of thread-safe write to val64, either using InterchangeExchange64 or similar, or by acquiring and releasing some kind of synchronisation object which will perform appropriate memory barrier instructions. Similarly, the thread would need to use a thread-safe accessor to read it as well.
In the case where the thread hasn't been resumed yet, it's a bit less clear. It's possible that ResumeThread might use or act like a synchronisation function and do the memory barrier operations, but the documentation doesn't specify that it does, so it is better to assume that it doesn't.
References:
On atomicity of 32- and 64- bit aligned types... https://msdn.microsoft.com/en-us/library/windows/desktop/ms684122%28v=vs.85%29.aspx
On 'volatile' including memory fences... https://msdn.microsoft.com/en-us/library/windows/desktop/ms686355%28v=vs.85%29.aspx

How to ensure that my workitems are running parallel?

CL_DEVICE_NAME = GeForce GT 630
CL_DEVICE_TYPE = CL_DEVICE_TYPE_GPU
CL_PLATFORM_NAME : NVIDIA CUDA
size_t global_item_size = 8;
size_t local_item_size = 1;
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Here, printing in the kernel is not allowed. Hence, how to ensure that all my 8 cores are running in parallel?
Extra info (regarding my question): for kernel, i am passing input and and output array of 8X8 size as a buffer. According to workitem number, i am solving that row and saving the result in output buffer. and after that i am reading the result.
If i am running AMD platform SDK, where i add print statement in kernel by
#pragma OPENCL EXTENSION cl_amd_printf : enable
hence i can see clearly, if i am using 4 core machine, my first 4 cores are running parallel and then rest will run in parallel, which shows it is solving maximum 4 in parallel.
But, how can i see the same for my CL_DEVICE_TYPE_GPU?
Any help/pointers/suggestions will be appreciated.
Using printf is not at all a reliable method of determining if your code is actually executing in parallel. You could have 4 threads running concurrently on a single core for example, and would still have your printf statements output in a non-deterministic order as the CPU time-slices between them. In fact, section 6.12.13.1 of the OpenCL 1.2 specification ("printf output synchronization") explicitly states that there are no guarantees about the order in which the output is written.
It sounds like what you are really after is a metric that will tell you how well your device is being utilised, which is different than determining if certain work-items are actually executing in parallel. The best way to do this would be to use a profiler, which would usually contain such a metric. Unfortunately NVIDIA's NVVP no longer works with OpenCL, so this doesn't really help you.
On NVIDIA hardware, work-items within a work-group are batched up into groups of 32, known as a warp. Each warp executes in a SIMD fashion, so the 32 work-items in the warp execute in lockstep. You will typically have many warps resident on each compute unit, potentially from multiple work-groups. The compute unit will transparently context switch between these warps as necessary to keep the processing elements busy when warps stall.
Your brief code snippet indicates that you are asking for 8 work-items with a work-group size of 1. I don't know if this is just an example, but if it isn't then this will almost certainly deliver fairly poor performance on the GPU. As per the above, you really want the work-group size to be multiple of 32, so that the GPU can fill each warp. Additionally, you'll want hundreds of work-items in your global size (NDRange) in order to properly fill the GPU. Running such a small problem size isn't going to be very indicative of how well your GPU can perform.
If you are enqueueing enough work items (at least 32 but ideally thousands) then your "workitems are running parallel".
You can see details of how your kernel is executing by using a profiling tool, for example Parallel Nsight on NVIDIA hardware or CodeXL on AMD hardware. It will tell you things about hardware occupancy and execution speed. You'll also be able to see memory transfers.

How does my OS know in which direction to flush if I use OpenMP flush function?

A thread that reads a shared variable has first to call flush, and a thread that writes to a shared variable has to call OpenMP flush afterwards, to keep the shared variable in main memory and cache synchronized. How does the flush function know in which direction to flush? It needs to know which of both variables (main memory or cache) is newer. I assume, but I am not sure, that the OS or CPU take care of this somehow. Does someone know?
flush is not a function - it is an OpenMP compiler directive. It affects the way the compiler generates the executable code and instructs it to synchronise the values of all optimised variables (stored in CPU registers or other explicitly programmable cache / thread-local memory) in the flush-set. This is similar to the effect that the volatile storage modifier has on code generation, but has more limited point-local effect.
How does it work? While parsing the source code, the compiler analyses the flow of statements and the data (variables) that gets affected by those statements. Consequently the compiler builds an execution graph and a data dependency graph from the code. It knows exactly where and how the value of each variable is being used and the execution of which code block affects which variables. Then the compiler tries to optimise the code by simplifying the graph and to reduce the number of expensive memory operations by either using CPU registers to store intermediate values or by using another for of faster thread-addressable local memory. The flush directive adds special points in the execution graph, where the compiler must explicitly synchronise the memory view of the thread (register variables and local-memory variables) with the global shared memory. Since the compiler has built the dependency graph in the first place, it knows exactly which variables in the flush-set were modified and hence have to be written to the shared memory; all other variables in the flush-set have to be read from the shared memory.
So the answer to your question is that it is usually the compiler who processes the flush directive, not the OS, although the compiler might call into the OS to actually implement the flush, e.g. on systems with explicitly programmable caches/local memories. But one should also note that OpenMP is an abstract standard, which can be implemented on many different hardware platforms and that some of those platforms provide certain hardware that can help with implementing the OpenMP abstractions more efficiently (e.g. the CPU ASIC in IBM's Blue Gene/Q provides many such features).
You don't need to call flush to keep shared variables synchronized.
The hardware (CPU) does keep track of cached memory and if there are conflicting accesses, they will slow down your program, because the cache will be flushed by CPU.
I understand the flush directive more like a conditional barrier.
A flush containing the same variable must be encountered by at least two threads to have an effect.
When this directive is met by two threads with say variable a in common, if they have modified it they will write back their modifications to memory (as opposed to keep it in a local variable or register), and then I suppose there is a barrier for both thread to get to that point before they continue.
If the variable a is used after the flush it is reread from memory.

Resources