CUDA - Blocks and Threads - parallel-processing

I have implemented a string matching algorithm on the GPU. The searching time of a parallel version has been decreased considerably compared with the sequential version of the algorithm, but by using different number of blocks and threads I get different results.
How can I determine the number of the blocks and threds to get the best results?

I think this question is hard, if not impossible, to answer for the reason that it really depends on the algorithm and how it is operating. Since i cant see your implementation i can give you some leads:
Don't use global memory & check how you can max out the use of shared memory. Generally get a good feel of how threads access memory and how data is retrieved etc.
Understand how your warps operate. Sometimes threads in a warp may wait for other threads to finish in case you have 1 to 1 mapping between thread and data. So instead of this 1 to 1 mapping, you can map threads to multiple data so that they are kept busy.
Since blocks consist of threads that are group in 32 threads warp, it is the best if the number of threads in a block is a multiple of 32, so that you dont get warps consisting of 3 threads etc.
Avoid Diverging paths in warps.
I hope it helps a bit.

#Chris points are very important too but depend more on the algorithm itself.
Check the Cuda Manual about Thread alignment regarding memory lookups. Shared Memory Arrays should also be size of multiple of 16.
Use Coalesced global memory reads. But by algorithm design this is often the case and using shared memory helps.
Don't use atomic operations in global memory or at all if possible. They are very slow. Some algorithms using atomic operations can be rewritten using different techniques.
Without shown code no-one can tell you what is the best or why performance changes.
The number of threads per block of your kernel is the most important value.
Important values to calculate that value are:
Maximum number of resident threads per multiprocessor
Maximum number of resident blocks per multiprocessor
Maximum number of threads per block
Number of 32-bit registers per multiprocessor
Your algorithms should be scalable across all GPU's reaching 100% occupancy. For this I created myself a helper class which automatically detects the best thread numbers for the used GPU and passes it to the Kernel as a DEFINE.
/**
* Number of Threads in a Block
*
* Maximum number of resident blocks per multiprocessor : 8
*
* ///////////////////
* Compute capability:
* ///////////////////
*
* Cuda [1.0 - 1.1] =
* Maximum number of resident threads per multiprocessor 768
* Optimal Usage: 768 / 8 = 96
* Cuda [1.2 - 1.3] =
* Maximum number of resident threads per multiprocessor 1024
* Optimal Usage: 1024 / 8 = 128
* Cuda [2.x] =
* Maximum number of resident threads per multiprocessor 1536
* Optimal Usage: 1536 / 8 = 192
*/
public static int BLOCK_SIZE_DEF = 96;
Example Cuda 1.1 to reach 786 resident Threads per SM
8 Blocks * 96 Threads per Block = 786 threads
3 Blocks * 256 Threads per Block = 786 threads
1 Blocks * 512 Threads per Block = 512 threads <- 33% of GPU will be idle
This is also mentioned in the book:
Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)
Good programming advices:
Analyse your kernel code and write down the maximal number of threads it can handle or how many "units" it can process.
Also output your register usage and try to lower it to the respective targeted CUDA version. Because if you use too many registers in your kernel less blocks will be executed resulting in less occupancy and performance.
Example: Using Cuda 1.1 and using optimal number of 768 resident threads per SM you have 8192 registers to use. This leads to 8192 / 768 = 10 maximum registers per thread/kernel. If you use 11 the GPU will use 1 Block less resulting in decreased performance.
Example: A matrix independent row vector normalizing kernel of mine.
/*
* ////////////////////////
* // Compute capability //
* ////////////////////////
*
* Used 12 registers, 540+16 bytes smem, 36 bytes cmem[1]
* Used 10 registers, 540+16 bytes smem, 36 bytes cmem[1] <-- with -maxregcount 10 Limit for Cuda 1.1
* I: Maximum number of Rows = max(x-dim)^max(dimGrid)
* II: Maximum number of Columns = unlimited, since they are loaded in a tile loop
*
* Cuda [1.0 - 1.3]:
* I: 65535^2 = 4.294.836.225
*
* Cuda [2.0]:
* II: 65535^3 = 281.462.092.005.375
*/

Related

Relation Between CUDA Blocks and Threads and SMPs

I recently read this CUDA tutorial: https://developer.nvidia.com/blog/even-easier-introduction-cuda/ and one thing was unclear. When we sum two vectors we divide the task into several blocks and threads to do this in parallel. My question is why then the number of blocks (and maybe threads) doesn't depend on physical properties of GPU, the number of physical SMPs and threads?
For example let's say GPU has 16 SMPs and each of them can run 128 threads, will it be faster to split the problem into 16 blocks by 128 threads or, like in the article, split by 4000 blocks with 256 threads?
It does not depend because the number of threads will depend mainly on your problem size and the block size will depend on your GPU architecture. For example, if your GPU has 3000 cores and can have blocks of a maximum of 512, and your code will process a matrix with a size of 2 billion, you will have to specify the "number of blocks X number of threads per block(which is not greater than 512)" that will be EQUAL or GREATER than 2 billion, then CUDA will smartly partition your blocks of threads into your 3000 CUDA cores of your GPU until all of the threads specified by the "numBLocks X numThreadsPerBlock" have been called by the GPU.

Why is the CPU slower for calculations then the GPU when only Memory should matter?

A modern CPU has a ethash hashrate from under 1MH/s (source: https://ethereum.stackexchange.com/questions/2325/is-cpu-mining-even-worth-the-ether ) while GPUs mine with over 20MH/s easily. With overclocked memory they reach rates up to 30MH/s.
GPUs have GDDR Memory with Clockrates of about 1000MHz while DDR4 runs with higher clock speeds. Bandwith of DDR4 seems also to be higher (sources: http://www.corsair.com/en-eu/blog/2014/september/ddr3_vs_ddr4_synthetic and https://en.wikipedia.org/wiki/GDDR5_SDRAM )
It is said for Dagger-Hashimoto/ethash bandwith of memory is the thing that matters (also experienced from overclocking GPUs) which I find reasonable since the CPU/GPU only has to do 2x sha3 (1x Keccak256 + 1x Keccak512) operations (source: https://github.com/ethereum/wiki/wiki/Ethash#main-loop ).
A modern Skylake processor can compute over 100M of Keccak512 operations per second (see here: https://www.cryptopp.com/benchmarks.html ) so then core count difference between GPUs and CPUs should not be the problem.
But why don't we get about ~50Mhash/s from 2xKeccak operations and memory loading on a CPU?
See http://www.nvidia.com/object/what-is-gpu-computing.html for an overview of the differences between CPU and GPU programming.
In short, a CPU has a very small number of cores, each of which can do different things, and each of which can handle very complex logic.
A GPU has thousands of cores, that operate pretty much in lockstep, but can only handle simple logic.
Therefore the overall processing throughput of a GPU can be massively higher. But it isn't easy to move logic from the CPU to the GPU.
If you want to dive in deeper and actually write code for both, one good starting place is https://devblogs.nvidia.com/gpu-computing-julia-programming-language/.
"A modern Skylake processor can compute over 100M of Keccak512 operations per second" is incorrect, it is 140 MiB/s. That is MiBs per second and a hash operation is more than 1 byte, you need to divide the 140 MiB/s by the number of bytes being hashed.
I found an article addressing my problem (the influence of Memory on the algorithm).
It's not only the computation problem (mentioned here: https://stackoverflow.com/a/48687460/2298744 ) it's also the Memorybandwidth which would bottelneck the CPU.
As described in the article every round fetches 8kb of data for calculation. This results in the following formular:
(Memory Bandwidth) / ( DAG memory fetched per hash) = Max Theoreticical Hashrate
(Memory Bandwidth) / ( 8kb / hash) = Max Theoreticical Hashrate
For a grafics card like the RX470 mentioned this results in:
(211 Gigabytes / sec) / (8 kilobytes / hash) = ~26Mhashes/sec
While for CPUs with DDR4 this will result in:
(12.8GB / sec) / (8 kilobytes / hash) = ~1.6Mhashes/sec
or (debending on clock speeds of RAM)
(25.6GB / sec) / (8 kilobytes / hash) = ~3.2Mhashes/sec
To sum up, a CPU or also GPU with DDR4 ram could not get more than 3.2MHash/s since it can't get the data fast enough needed for processing.
Source:
https://www.vijaypradeep.com/blog/2017-04-28-ethereums-memory-hardness-explained/

How to interpret stats/show in Rebol 3

Wanted to make some profiling on a R3 script and was checking at the stats command.
But what do these informations mean?
How can it be used to monitor memory usage?
>> stats/show
Series Memory Info:
node size = 16
series size = 20
5 segs = 409640 bytes - headers
4888 blks = 812448 bytes - blocks
1511 strs = 86096 bytes - byte strings
2 unis = 86016 bytes - unicode strings
4 odds = 39216 bytes - odd series
6405 used = 1023776 bytes - total used
0 free / 14075 bytes - free headers / node-space
Pool[ 0] 8B 202/ 3328: 256 ( 6%) 13 segs, 26728 total
Pool[ 1] 16B 178/ 512: 256 (34%) 2 segs, 8208 total
Pool[ 2] 32B 954/ 2560: 512 (37%) 5 segs, 81960 total
...
Pool[26] 64B 0/ 0: 128 ( 0%) 0 segs, 0 total
Pools used 654212 of 1906200 (34%)
System pool used 497664
== 1023776
It shows the internal memory management information, not sure how useful it would be to the script.
Anyway, here are some explanations about the memory pools.
Most pools are for series (there is a dedicated pool for GOB!s, and some others if you're looking at Atronix source code), to make it simple, I will focus on series pools here.
Internally, a series has a header and its data which is a chunk of contiguous memory. The header has the width and length info about the series. The data holds the actual content of the series. In R3, Series is used extensively to implement block!, port!, string!, object!, etc. So managing memory in R3 is almost managing (allocating and destroying) series. Because of the difference in the width and length of serieses, pools are introduced to reduce the fragmentation.
When a new series is needed, the header is allocated in a special pool, and another pool is chosen for its data. The pool whose width is closed to the size of the series is chosen. E.g. a block with 3 elements will probably be allocated in a pool with width of 128-byte (on 32-bit systems, a block is a series with 4 (3 + 1 terminater) elements). As a pool could increase as the the program runs, it's implemented as a list of segments. New segments will be allocated and appended to the list as needed (but it's never released back to system).
Another special pool is the system pool, which is chosen when the required memory is big. R3 doesn't actually manage this pool other than collecting some statistics.
When it tries to collect garbage, it will sweep the root context, and mark everything that can be reachable, then it will go through the series header pool, and find out all unneeded serieses and destroy them.
If you use stats without a refinement, you can see the actual memory usage. So comparing memory usage before and after your implementations you can see which one uses less memory.
>> stats
== 1129824
>> s: make string! 1024
== ""
>> stats
== 1132064

Measuring effective bandwidth on CUDA

So I want to know how to calculate the total memory effective bandwidth for:
cublasSdot(handle, M, devPtrA, 1, devPtrB, 1, &curesult);
where that function belows to cublas_v2.h
That function runs in 0.46 ms, and the vectors are 10000 * sizeof(float)
Am I having ((10000 * 4) / 10^9 )/0.00046 = 0.086 GB/s?
I'm wondering about it because I don't know what is inside the cublasSdot function, and I don't know if it is necesary.
In your case, the size of the input data is 10000 * 4 * 2 since you have 2 input vectors, and the size of the output data is 4. The effective bandwidth should be about 0.172 GB/s.
Basically cublasSdot() does nothing much more than computing.
Profile result shows cublasSdot() invokes 2 kernels to compute the result. An extra 4-bytes device-to-host mem transfer is also invoked if the pointer mode is CUBLAS_POINTER_MODE_HOST, which is the default mode for cublas lib.
If kernel time is in ms then a multiplication factor of 1000 is necessary.
That results in 86 GB/s.
As an example refer to example provide by NVIDIA for Matrix Transpose
at http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf
On Last Page entire code is present. The way the Effective Bandwidth is computed is 2.*1000*mem_size/(1024*1024*1024)/(Time in ms)

optimal number of CUDA parallel blocks

Can there be any performance advantage to launch a grid of blocks simultaneously over launching blocks one at a time if the number of threads in each block is already larger than the number of CUDA cores?
I think there is; A thread block is assigned to a Streaming Multiprocessor (SM) and the SM further divides the threads of each block into warps of 32 threads (newer architectures can handle larger warps) that are scheduled to be executed (more-less) sequentially. Considering this, it will be faster to break each computation into blocks so that they occupy as many SMs as possible. It is also meaning full to build blocks that are multiples of the threads per warp that the card supports (a block of 32 or 64 threads rather than 40 threads, for the case that SMs use 32-thread warps).
Launch Latency
Launch latency (API call to work is started on the GPU) is of a grid is 3-8 µs on Linux to
30-80 µs on Windows Vista/Win7.
Distributing a block to a SM is 10-100s ns.
Launching a warp in a block (32 threads) is a few cycles and happens in parallel on each SM.
Resource Limitations
Concurrent Kernels
- Tesla N/A only 1 grid at a time
- Fermi 16 grids at a time
- Kepler 16 grids (Kepler2 32 grids)
Maximum Blocks (not considering occupancy limitations)
- Tesla SmCount * 8 (gtx280 = 30 * 8 = 240)
- Fermi SmCount * 16 (gf100 = 16 * 16 = 256)
- Kepler SmCount * 16 (gk104 = 8 * 16 = 128)
See occupancy calculator for limitations on threads per block, threads per SM, registers per SM, registers per thread, ...
Warps Scheduling and CUDA Cores
CUDA cores are floating point/ALU units. Each SM has other types of execution units including load/store, special function, branch, etc. A CUDA core is equivalent to a SIMD unit in a x86 processor. It is not equivalent to a x86 core.
Occupancy is the measure of warps per SM to the maximum number of warps per SM. The more warps per SM the higher the chance that the warp scheduler has an eligible warp to schedule. However, the higher the occupancy the less resources will be available per thread. As a basic goal you want to target more than
25% 8 warps on Tesla
50% or 24 warps on Fermi
50% or 32 warps on Kepler (generally higher)
You'll notice there is no real relationship to CUDA cores in these calculations.
To understand this better read the Fermi whitepaper and if you can use the Nsight Visual Studio Edition CUDA Profiler look at the Issue Efficiency Experiment (not yet available in the CUDA Profiler or Visual Profiler) to understand how well your kernel is hiding execution and memory latency.

Resources