GPU programming model - how many simultaneous, divergent threads without penalty - gpgpu

I am new to GPGPU and CUDA. From my reading, on current-generation CUDA GPU's, threads get bundled into warps of 32 threads. All threads in a warp execute the same instructions so if there is divergence in branches all threads essentially take the time corresponding to taking all the incurred branches. However, it seems that different warps executing simultaneously on the GPU can have divergent branches without this cost since the different warps are executed by separate computational resources. So my question is, how many concurrent warps can be so executed where divergence doesn't cause this penality... i.e. what number is it that I should look for in the spec sheet. Is it the number of "shader processors" or the number of "Streaming multiprocessors" that is relevant here?
Also, the same question for AMD Radeon: Here the relevant terms might be "unified shaders" and "compute units".
Finally, suppose I have a workload that is highly divergent across threads so I essentially just want one thread per warp. Essentially using the GPU as an ordinary multi-core CPU. Is that possible and how should I lay out the threads and thread-blocks for this to happen? Can I avoid allocating memory etc. for the 31 redundant threads in the warp. I realize this might not be the ideal workload for GPGPU but it would be usable for running an activity in the background without blocking the host CPU.

I am new to GPGPU and am instead learning OpenCL. But this question has remained unanswered for months, so I'll have a stab at it (and hopefully an expert will correct me if I'm wrong).
However, it seems that different warps executing simultaneously on the GPU can have divergent branches without this cost since the different warps are executed by separate computational resources
Not necessarily. On AMD systems, only 64 work-items (called Threads in CUDA) are worked on at any given time (technically: each VALU in AMD systems works on 16 items at once, but any given instruction is repeated four-times, every time. So 64-items per "AMD Wavefront"). On NVidia systems, it seems like 32-threads are executed at a time per warp.
Of course, the "Block Size" is likely far larger than 64. So if you were doing 32x32 pixel blocks, you'd need 1024 cores / shaders / work items per work group (OpenCL) or Warp.
These 1024 threads CAN diverge without penalty under NVidia Pascal, because they're split into sets of 32.
So if you have a work group / warp size of 1024, correlating to 32x32 block of pixels... the first two rows will execute on one VALU (AMD GCN) or SM (NVidia Pascal). As long as ALL of those 32 threads / 64-work items take the same branches, you won't have any penalties.
Finally, suppose I have a workload that is highly divergent across threads so I essentially just want one thread per warp. Essentially using the GPU as an ordinary multi-core CPU. Is that possible and how should I lay out the threads and thread-blocks for this to happen? Can I avoid allocating memory etc. for the 31 redundant threads in the warp. I realize this might not be the ideal workload for GPGPU but it would be usable for running an activity in the background without blocking the host CPU.
if( threadid> 0) {
} else {
dostuff();
}
Honestly, I think its best if you just diverge and hope for the best. All of those cores have resources of their own (Registers and stuff).

Related

Why the time needed for executing code on "Hackerrank" site varies significantly?

When you try some coding challenges on www.hackerrank.com and you track the time also (e.g. in C++ using the "chrono" library), you will see every time you run the Code you will get a different time needed for execution at exactly the same Code. The variance I estimate at 10 to 30 percent. What is the reason that Code execution times are varying significantly at equal code? Which factors account to it?
It could be the Server System; but are there even physical reasons (stochastic processes in electronic components)?
Almost certainly just a busy system where your process competes for CPU time against other processes. (And competes for memory bandwidth even when it has a CPU to itself).
Also it's likely that the server runs on a CPU with SMT (Simultaneous multithreading) that uses each physical cores as two logical cores, so the "shared execution resources" that processes compete for includes stuff inside a single CPU core, not just L3 cache and memory bandwidth.
Intel calls their SMT tech "hyperthreading"; most servers these days run on Intel Xeon CPUs. AMD Zen also uses SMT, so either way unless the server admins disabled it, when the OS schedules tasks onto both logical cores of a single physical core, they will slow each other down some (with the slowdown amount mostly depending on their average IPC (instructions per cycle) - two threads that have a lot of branch mispredicts will typically not see much slowdown (so throughput nearly doubles). But two threads that could both nearly saturate the FP multiplier ALUs on their own will each run at nearly half speed (near zero gain in throughput).
OSes "know about" SMT / Hyperthreading and can detect which logical cores share a physical core. They try to avoid scheduling threads to the same physical core while there are some physical cores with both threads idle.
See also Modern Microprocessors
A 90-Minute Guide! which covers SMT, with the background to understand what execution resources are being shared.
but are there even physical reasons (stochastic processes in electronic components)?
No, CPUs are deterministic (except for the rdrand instruction which uses true analog electrical noise as a randomness source).
CPUs do use dynamic voltage and frequency scaling to save power (idle vs. max turbo, or somewhere in between if thermal limits don't allow max turbo.) https://en.wikipedia.org/wiki/Intel_Turbo_Boost / https://en.wikipedia.org/wiki/Dynamic_frequency_scaling
See also Idiomatic way of performance evaluation? for more microbenchmarking pitfalls / warm-up effects.

Why is parallel compilation performance with HT worse than without?

I've made several measurements of compilation time of wine with HyperThreading enabled and disabled in BIOS on my Core i7 930 #2.8GHz (quad-core) on Linux 2.6.39 x86_64. Each measurement was like this:
git clean -xdf
./configure --prefix=/usr
time make -j$N
where N is number from 1 to 8.
Here're the results ("speed" is 60/real from time(1)):
Here the blue line corresponds to HT disabled and purple one to HT enabled. It appears that when HT is enabled, using 1-4 threads is slower than without HT. I guess this might be related to the kernel not distributing the processes to different cores and reusing second threads of already busy cores.
So, my question: how can I force the kernel to give 1 process per core scheduling higher priority than adding more processes to the same core's different thread? Or, if my reasoning is wrong, how can I have performance with HT not worse than without HT for 1-4 processes running in parallel?
Hyper-threading on Intel chips is implemented as duplication of some of the elements of a pysical core but without enough electronics to be an independent core (e.g. they may share an instruction decoder but I cant recall the specifics of Intel's implementation).
Image a pysical core with HT as 1.5 physical cores that your OS sees as 2 real cores. This doesn't equate to 1.5x speed though (this can vary depending on use case)
In your example, non-HT is faster up to 4 threads because none of the cores are sharing work with their HT pipeline. You see a flatline above 4 threads because now you only have 4 execution threads and you get a little extra overhead context switching between threads.
In the HT example you are a bit slower up to 4 threads probably because some of those threads are being assigned to a real core and it's HT, so you are losing performance as those two execution threads share physical resources. Above 4 threads you are seeing the benefit of the extra execution threads, but you see the beginning of diminishing returns.
You could probably match performance on both cases for up to 4 threads, but likely not with a compilation job. To many processes being spawned for processor affinity to be setup I think. If you instead ran a real parallel job using OpenMP or MPI with X<=4 threads bound to the specific real CPU cores, I think you'd see similar performance between HT-off and -on.
Given a number of threads <= the number of real cores, using HT should be slower because (considered crudely) you are potentially cutting the speed of your cores in half.1
Keep in mind that generally more cores is NOT better than FASTER cores. In fact, the only reason so much work was put into developing multi-core systems is that it became increasingly difficult to make faster and faster ones. So if you cannot have a 20 Ghz processor, then 8 x 3 Ghz ones will have to do.
HT is, I believe, primarily intended as an advantage in contexts where each thread is not necessarily gobbling as much processor as it can; it's doing some particular task that's governed by interaction with a user, such as CAD stuff, video games, etc; these are the kind of applications that benefit from multi-tasking. By contrast, server platforms -- wherein the primary applications tend to thread independent tasks that are not governed by a dependence on anything else, hence are optimally run as fast as possible -- do not benefit directly from multi-tasking; they benefit from speed. make is in the same category, although with a perhaps greater degree of interdependence between threads, which is why you see an advantage for HT from 4-8 threads.
1. This is a simplification. HT doesn't simply double the number of cores and halve their speed, but whatever dynamic is used, the total number of processor cycles per second for the system is not improved. It's the same -- only more fragmented.

hyperthreading and turbo boost in matrix multiply - worse performance using hyper threading

I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.

How can warps in the same block diverge

I am a bit confused how it is possible that Warps diverge and need to be synchronized via __syncthreads() function. All elements in a Block handle the same code in a SIMT fashion. How could it be that they are not in sync? Is it related to the scheduler? Do the different warps get different computing times? And why is there an overhead when using __syncthreads()?
Lets say we have 12 different Warps in a block 3 of them have finished their work. So now there are idling and the other warps get their computation time. Or do they still get computation time to do the __syncthreads() function?
First let's be careful with terminology. Warp divergence refers to threads within a single warp that take different execution paths, due to control structures in the code (if, while, etc.) Your question really has to do with warps and warp scheduling.
Although the SIMT model might suggest that all threads execute in lockstep, this is not the case. First of all, threads within different blocks are completely independent. They may execute in any order with respect to each other. For your question about threads within the same block, let's first observe that a block can have up to 1024 (or perhaps more) threads, but today's SM's (SM or SMX is the "engine" inside the GPU that processes a threadblock) don't have 1024 cuda cores, so it's not even theoretically possible for an SM to execute all threads of a threadblock in lockstep. Note that a single threadblock executes on a single SM, not across all (or more than one) SMs simultaneously. So even if a machine has 512 or more total cuda cores, they cannot all be used to handle the threads of a single threadblock, because a single threadblock executes on a single SM. (One reason for this is so that SM-specific resources, like shared memory, can be accessible to all threads within a threadblock.)
So what happens? It turns out each SM has a warp scheduler. A warp is nothing more than a collection of 32 threads that gets grouped together, scheduled together, and executed together. If a threadblock has 1024 threads then it has 32 warps of 32 threads per warp. Now, for example, on Fermi, an SM has 32 CUDA cores, so it is reasonable to think about an SM executing a warp in lockstep (and that is what happens, on Fermi). By lockstep, I mean that (ignoring the case of warp divergence, and also certain aspects of instruction-level-parallelism, I'm trying to keep the explanation simple here...) no instruction in the warp is executed until the previous instruction has been executed by all threads in the warp. So a Fermi SM can only actually be executing one of the warps in a threadblock at any given instant. All other warps in that threadblock are queued up, ready to go, waiting.
Now, when the execution of a warp hits a stall for any reason, the warp scheduler is free to move that warp out and bring another ready-to-go warp in (this new warp might not even be from the same threadblock, but I digress.) Hopefully by now you can see that if a threadblock has more than 32 threads in it, not all the threads are actually getting executed in lockstep. Some warps are proceeding ahead of other warps.
This behavior is normally desirable, except when it isn't. There are times when you do not want any thread in the threadblock to proceed beyond a certain point, until a condition is met. This is what __syncthreads() is for. For example, you might be copying data from global to shared memory, and you don't want any of the threadblock data processing to commence until shared memory has been properly populated. __syncthreads() ensures that all threads have had a chance to copy their data element(s) before any thread can proceed beyond the barrier and presumably begin computations on the data that is now resident in shared memory.
The overhead with __syncthreads() is in two flavors. First of all there's a very small cost just to process the machine level instructions associated with this built-in function. Second, __syncthreads() will normally have the effect of forcing the warp scheduler and SM to shuffle through all the warps in the threadblock, until each warp has met the barrier. If this is useful, great. But if it's not needed, then you're spending time doing something that isn't needed. So thus the advice to not just liberally sprinkle __syncthreads() through your code. Use it sparingly and where needed. If you can craft an algorithm that doesn't use it as much as another, that algorithm may be better (faster).

Does modern GPU (e.g Fermi/Evergreen) supports out of order execution?

I am writing an OpenCL kernel which involves a few barriers in a loop. I have tested the kernel on CPU (8-core FX8150) and the result shows these barriers reduced running speed by a factor of 50~100 times (I further verified this by re-implementing the kernel on Java using multi-threading + CyclicBarrier). I suspect the reason was barrier essentially stops the CPU taking advantage of out-of-order execution, so I am a little worried if I would observe the same magnitude of speed decrease on GPU. I checked a few official documents and googled around a bit but there is little information available on this topic.
Current state-of-the art GPUs are in-order pipelined processor. GPUs fill the pipeline effectively by interleaving instructions from different warps (wavefronts). In comparisons, CPUs use out-of-order speculative execution to fill the pipeline. There are different functional units like ALUs and SFUs which have separated pipelines. But notice that instruction dependency stalls the warp. For more information on instruction dependency resolving on GPUs refer to this NVIDIA patent.
NVIDIA’s Next Generation
CUDA Compute and Graphics Architecture, Code-Named “Fermi”:
Nvidia GigaThread Engine has capabilities of(at page 5)
10x faster application context switching
Concurrent kernel execution
Out of Order thread block execution :)
Dual overlapped memory transfer engines
Evergreen has SIMD capabilities and has a chance outperform some fermi but i dont know about oooe of it. There is also "local atomic add" upper hand of HD 7000 series compared to GTX 600 series (nearly 10x faster)

Resources