optimal number of CUDA parallel blocks - parallel-processing

Can there be any performance advantage to launch a grid of blocks simultaneously over launching blocks one at a time if the number of threads in each block is already larger than the number of CUDA cores?

I think there is; A thread block is assigned to a Streaming Multiprocessor (SM) and the SM further divides the threads of each block into warps of 32 threads (newer architectures can handle larger warps) that are scheduled to be executed (more-less) sequentially. Considering this, it will be faster to break each computation into blocks so that they occupy as many SMs as possible. It is also meaning full to build blocks that are multiples of the threads per warp that the card supports (a block of 32 or 64 threads rather than 40 threads, for the case that SMs use 32-thread warps).

Launch Latency
Launch latency (API call to work is started on the GPU) is of a grid is 3-8 µs on Linux to
30-80 µs on Windows Vista/Win7.
Distributing a block to a SM is 10-100s ns.
Launching a warp in a block (32 threads) is a few cycles and happens in parallel on each SM.
Resource Limitations
Concurrent Kernels
- Tesla N/A only 1 grid at a time
- Fermi 16 grids at a time
- Kepler 16 grids (Kepler2 32 grids)
Maximum Blocks (not considering occupancy limitations)
- Tesla SmCount * 8 (gtx280 = 30 * 8 = 240)
- Fermi SmCount * 16 (gf100 = 16 * 16 = 256)
- Kepler SmCount * 16 (gk104 = 8 * 16 = 128)
See occupancy calculator for limitations on threads per block, threads per SM, registers per SM, registers per thread, ...
Warps Scheduling and CUDA Cores
CUDA cores are floating point/ALU units. Each SM has other types of execution units including load/store, special function, branch, etc. A CUDA core is equivalent to a SIMD unit in a x86 processor. It is not equivalent to a x86 core.
Occupancy is the measure of warps per SM to the maximum number of warps per SM. The more warps per SM the higher the chance that the warp scheduler has an eligible warp to schedule. However, the higher the occupancy the less resources will be available per thread. As a basic goal you want to target more than
25% 8 warps on Tesla
50% or 24 warps on Fermi
50% or 32 warps on Kepler (generally higher)
You'll notice there is no real relationship to CUDA cores in these calculations.
To understand this better read the Fermi whitepaper and if you can use the Nsight Visual Studio Edition CUDA Profiler look at the Issue Efficiency Experiment (not yet available in the CUDA Profiler or Visual Profiler) to understand how well your kernel is hiding execution and memory latency.

Related

Theoretical maximum performance (FLOPS ) of Intel Xeon E5-2640 v4 CPU, using only addition?

I am confused about the theoretical maximum performance of the Intel Xeon E5-2640 v4 CPU (Boardwell-based). In this post, >800GFLOPS; in this post, about 200GFLOPS; in this post, 3.69GFLOPS per core, 147.70GFLOPS per computer. So what is the theoretical maximum performance of Intel Xeon E5-2640 v4 CPU?
Some specifications:
Processor Base Frequency = 2.4GHz;
Max turbo frequency = 3.4GHz;
IPC (instruction per cycle) = 2;
Instruction Set Extensions: AVX2, so #SIMD = 256/32 = 8;
I tried to compute the theoretical maximum FLOPS. Based on my understanding, it should be (Max turbo frequency) * (IPC) * (#SIMD), which is 3.4 * 2 * 8 = 54.4GFLOPS, is it right?
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)? What if additions and multiplications do not appear at the same time? (eg. if the workload only contains additions, is *2 appropriate?)
Besides, the above computation should be the maximum FLOPS per core, right?
3.4 GHz is the max single-core turbo (and also 2-core), so note that this isn't the per-core GFLOPS, it's the single-core GFLOPS.
The max all-cores turbo is 2.6 GHz on that CPU, and probably won't sustain that for long with all cores maxing out their SIMD FP execution units. That's the most power-intensive thing x86 CPUs can do. So it will likely drop back to 2.4 GHz if you actually keep all cores busy.
And yes you're missing a factor of two because FMA counts as two FP operations, and that's what you need to do to achieve the hardware's theoretical max FLOPS. FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 . (Your Broadwell is the same as Haswell for max-throughput purposes.)
If you're only using addition then only have one FLOP per SIMD element per instruction, and also only 1/clock FP instruction throughput on a v4 (Broadwell) or earlier.
Haswell / Broadwell have two fully-pipelined SIMD FMA units (on ports 0 and 1), and one fully-pipelined SIMD FP-add unit (on port 1) with lower latency than FMA.
The FP-add unit is on the same execution port as one of the FMA units, so it can start 2 FP uops per clock, up to one of which can be pure addition. (Unless you do addition x+y as fma(x, 1.0, y), trading higher latency for more throughput.)
IPC (instruction per cycle) = 2;
Nope, that's the number of FP math instructions per cycle, max, not total instructions per clock. The pipeline's narrowest point is 4 uops wide, so there's room for a bit of loop overhead and a store instruction every cycle as well as two SIMD FP operations.
But yes, 2 FP operations started per clock, if they're not both addition.
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)?
You're already multiplying by IPC=2 for parallel additions and multiplications.
If you mean FMA (Fused Multiply-Add), then no, that's literally doing them both as part of a single operation, not in parallel as a "pipeline technique". That's why it's called "fused".
FMA has the same latency as multiply in many CPUs, not multiply and then addition. (Although on Broadwell, FMA latency = 5 cycles, vmulpd latency = 3 cycles, vaddpd latency = 3 cycles. All are fully pipelined, with a throughput discussed in the rest of this answer, since theoretical max throughput requires arranging your calculations to not bottleneck on the latency of addition or multiplication. e.g. using multiple accumulators for a dot product or other reduction.) Anyway, point being, a hardware FMA execution unit is not terribly more complex than an FP multiplier or adder, and you shouldn't think of it as two separate operations.
If you write a*b + c in the source, a compiler can contract that into an FMA, instead of rounding the a*b to a temporary result before addition, depending on compiler options (and defaults) to allow that or not.
How to use Fused Multiply-Add (FMA) instructions with SSE/AVX
FMA3 in GCC: how to enable
Instruction Set Extensions: AVX2, so #SIMD = 256/64 = 8;
256/64 = 4, not 8. In a 32-byte (256-bit) SIMD vector, you can fit 4 double-precision elements.
Per core per clock, Haswell/Broadwell can begin up to:
two FP math instructions (FMA/MUL/ADD), up to one of which can be addition.
FMA counts as 2 FLOPs per element, MUL/ADD only count as 1 each.
on up to 32 byte wide inputs (e.g. 4 doubles or 8 floats)

Relation Between CUDA Blocks and Threads and SMPs

I recently read this CUDA tutorial: https://developer.nvidia.com/blog/even-easier-introduction-cuda/ and one thing was unclear. When we sum two vectors we divide the task into several blocks and threads to do this in parallel. My question is why then the number of blocks (and maybe threads) doesn't depend on physical properties of GPU, the number of physical SMPs and threads?
For example let's say GPU has 16 SMPs and each of them can run 128 threads, will it be faster to split the problem into 16 blocks by 128 threads or, like in the article, split by 4000 blocks with 256 threads?
It does not depend because the number of threads will depend mainly on your problem size and the block size will depend on your GPU architecture. For example, if your GPU has 3000 cores and can have blocks of a maximum of 512, and your code will process a matrix with a size of 2 billion, you will have to specify the "number of blocks X number of threads per block(which is not greater than 512)" that will be EQUAL or GREATER than 2 billion, then CUDA will smartly partition your blocks of threads into your 3000 CUDA cores of your GPU until all of the threads specified by the "numBLocks X numThreadsPerBlock" have been called by the GPU.

How to calculate speedup and efficiency in a hybrid CPU and GPU algorithm?

I have an algorithm that I have executed in parallel using only CPU and I have achieved a speedup of 30x. That is, an efficiency equal to 0.93 (efficiency = speedup/cores, i.e. 0.93 = 30/32).
Later I added 2 GPUs (Tesla C2075 of 448 cores each) together to the 32 CPU cores.
To calculate the efficiency including CPUs and GPUs, should I add the amount of GPU cores to the CPU cores? That is, I would calculate the efficiency using 928 cores (32 + 448 + 448 = 928). Or should it be calculated differently?
Speedup and efficiency has been calculated based on what has been said here:
https://software.intel.com/en-us/articles/predicting-and-measuring-parallel-performance
GPUs have bigger "core complex" architectures called "SM" or "CU" with tens of pipelines each. Not "very" similar to "SIMD" of a CPU, they can issue commands in parallel to these pipelines in a "single-threaded" kernel code.
You have counted "cores" in CPU and not SIMD pipelines (which is 4 to 16 times of number of cores) so, it wouldn't be wrong to count SM units of Nvidia or CU of Amd or Slice subset of Intel etc.
Tesla C2075 has 14 SM units so you could add 14 for each GPU (32+14+14).
If you have also used SIMDified code for CPU, then it wouldn't be wrong to count each pipeline of a GPU which is 32 to 192 times the number of SM/CU(like 448 per GPU of yours) (32*SIMD_WIDTH + 448 + 448).
At least this is how I would compute "core efficiency" and "pipeline efficiency". If data transfer to/from GPU is not a bottleneck, efficiency should not drop much after GPUs are added.

How can CPU's have FLOPS much higher than their clock speeds?

For example, a modern i7-8700k can supposedly do ~60 GFLOPS (single-precision, source) while its maximum frequency is 4.7GHz. As far as I am aware, an instruction has to take at least one cycle to complete, so how is this possible?
There are multiple factors that are all multiplied together for this large effect:
SIMD, Intel 8700k and similar processors support AVX and AVX2, which includes many instructions that operate on registers that can hold 8 floats at the same time.
multiple cores, 8700k has 6 cores.
fused multiply-add, part of AVX2, has both a multiplication and addition in the same instruction.
high throughput execution. The latency (time an individual instruction takes) is not directly important to how much computation a processor can do in a unit of time. A modern CPU such as 8700k can start executing two (independent) FMAs in the same cycle (and keep in mind these are still SIMD instructions so that represents a lot of floating point operations) even through the latency of the operation is actually 4 cycles.
Multiplying all those factors together we get: 8 * 6 * 2 * 2 * 4.3 = 825 GFLOPS (matching the stats reported here). This calculation certainly does not mean that it can actually be attained. For example the processor may downclock significantly under such a workload in order to stay within its power budget, which is what Intel has been doing at least since Haswell (though the specifics have changed and it applied to server parts). Also, most real code has significant trouble feeding that many FMAs with data. Large matrix multiplications can get close though, and for example according to these stats the 8700k reached 496.7 Gflops in their SGEMM benchmark. Possibly the 8700k's max AVX2 turbo speed on 6 cores is 2.6GHz but as far as I can find it does not have an AVX offset by default (only needed when overclocked), or that GEMM is just not that close to hitting peak FLOPS.

Low GPU usage in CUDA

I implemented a program which uses different CUDA streams from different CPU threads. Memory copying is implemented via cudaMemcpyAsync using those streams. Kernel launches are also using those streams. The program is doing double-precision computations (and I suspect this is the culprit, however, cuBlas reaches 75-85% CPU usage for multiplication of matrices of doubles). There are also reduction operations, however they are implemented via if(threadIdx.x < s) with s decreasing 2 times in each iteration, so stalled warps should be available to other blocks. The application is GPU and CPU intensive, it starts with another piece of work as soon as the previous has finished. So I expect it to reach 100% of either CPU or GPU.
The problem is that my program generates 30-40% of GPU load (and about 50% of CPU load), if trusting GPU-Z 1.9.0. Memory Controller Load is 9-10%, Bus Interface Load is 6%. This is for the number of CPU threads equal to the number of CPU cores. If I double the number of CPU threads, the loads stay about the same (including the CPU load).
So why is that? Where is the bottleneck?
I am using GeForce GTX 560 Ti, CUDA 8RC, MSVC++2013, Windows 10.
One my guess is that Windows 10 applies some aggressive power saving, even though GPU and CPU temperatures are low, the power plan is set to "High performance" and the power supply is 700W while power consumption with max CPU and GPU TDP is about 550W.
Another guess is that double-precision speed is 1/12 of the single-precision speed because there are 1 double-precision CUDA core per 12 single-precision CUDA cores on my card, and GPU-Z takes as 100% the situation when all single-precision and double-precision cores are used. However, the numbers do not quite match.
Apparently the reason was low occupancy due to CUDA threads using too many registers by default. To tell the compiler the limit on the number of registers per thread, __launch_bounds__ can be used, as described here. So to be able to launch all 1536 threads in 560 Ti, for block size 256 the following can be specified:
_global__ void __launch_bounds__(256, 6) MyKernel(...) { ... }
After limiting the number of registers per CUDA thread, the GPU usage has raised to 60% for me.
By the way, 5xx series cards are still supported by NSight v5.1 for Visual Studio. It can be downloaded from the archive.
EDIT: the following flags have further increased GPU usage to 70% in an application that uses multiple GPU streams from multiple CPU threads:
cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost | cudaDeviceLmemResizeToMax);
cudaDeviceScheduleYield lets other threads execute when a CPU
thread is waiting on GPU operation, rather than spinning GPU for the
result.
cudaDeviceLmemResizeToMax, as I understood it, makes kernel
launches themselves asynchronous and avoids excessive local memory
allocations&deallocations.

Resources