Pytorch - multiple GPUs - parallel-processing

Pytorch - multiple GPUs - parallel-processing

I have a rather large network that requires lots of GPU memory. I have 4 GPUs and I use nn.DataParallel. When I define a batch size of 4 (1 input for each GPU) my GPUs run out of memory. I tried torch.cuda.empty_cache() and removing any unnecessary data from the GPUs, but it was not enough.
I am looking for a way to treat the 4 GPUs as 2 GPUs with double memory, so that I could use a batch size of 2 (1 input for two GPUs). Is there any way to do so?
Thanks.

Related

Relation Between CUDA Blocks and Threads and SMPs

I recently read this CUDA tutorial: https://developer.nvidia.com/blog/even-easier-introduction-cuda/ and one thing was unclear. When we sum two vectors we divide the task into several blocks and threads to do this in parallel. My question is why then the number of blocks (and maybe threads) doesn't depend on physical properties of GPU, the number of physical SMPs and threads?
For example let's say GPU has 16 SMPs and each of them can run 128 threads, will it be faster to split the problem into 16 blocks by 128 threads or, like in the article, split by 4000 blocks with 256 threads?

It does not depend because the number of threads will depend mainly on your problem size and the block size will depend on your GPU architecture. For example, if your GPU has 3000 cores and can have blocks of a maximum of 512, and your code will process a matrix with a size of 2 billion, you will have to specify the "number of blocks X number of threads per block(which is not greater than 512)" that will be EQUAL or GREATER than 2 billion, then CUDA will smartly partition your blocks of threads into your 3000 CUDA cores of your GPU until all of the threads specified by the "numBLocks X numThreadsPerBlock" have been called by the GPU.

How does OpenCL distribute work items?

I'm testing and comparing GPU speed up with different numbers of work-items (no work-groups). The kernel I'm using is a very simple but long operation. When I test with multiple work-items, I use a barrier function and split the work in smaller chunks to get the same result as with just one work-item. I measure the kernel execution time using cl_event and the results are the following:
1 work-item: 35735 ms
2 work-items: 11822 ms (3 times faster than with 1 work-item)
10 work-items: 2380 ms (5 times faster than with 2 work-items)
100 work-items: 239 ms (10 times faster than with 10 work-items)
200 work-items: 122 ms (2 times faster than with 100 work-items)
CPU takes about 580 ms on average to do the same operation.
The only result I don't understand and can't explain is the one with 2 work items. I would expect the speed up to be about 2 times faster compared to the result with just one work item, so why is it 3?
I'm trying to make sense of these numbers by looking at how these work-items were distributed on processing elements. I'm assuming if I have just one kernel, only one compute unit (or multiprocessor) will be activated and the work items distributed on all processing elements (or CUDA cores) of that compute unit. What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
CL_DEVICE_MAX_WORK_ITEM_SIZES are 1024 / 1024 / 64 and CL_DEVICE_MAX_WORK_GROUP_SIZE 1024. Since I'm using just one dimension, does that mean I can have 1024 work-items running at the same time per processing element or per compute unit? When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
My GPU info: Nvidia GeForce GT 525M, 96 CUDA cores (2 compute units, 48 CUDA cores per unit)

The only result I don't understand and can't explain is the one with 2
work items. I would expect the speed up to be about 2 times faster
compared to the result with just one work item, so why is it 3?
The exact reasons will probably be hard to pin down, but here are a few suggestions:
GPUs aren't optimised at all for small numbers of work items. Benchmarking that end of the scale isn't especially useful.
35 seconds is a very long time for a GPU. Your GPU probably has other things to do, so your work-item is probably being interrupted many times, with its context saved and resumed every time.
It will depend very much on your algorithm. For example, if your kernel uses local memory, or a work-size dependent amount of private memory, it might "spill" to global memory, which will slow things down.
Depending on your kernel's memory access patterns, you might be running into the effects of read/write coalescing. More work items means fewer memory accesses.
What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
Most GPU hardware supports a form of SMT to hide memory access latency. So a compute core will have up to some fixed number of work items in-flight at a time, and if one of them is blocked waiting for a memory access or barrier, the core will continue executing commands on another work item. Note that the maximum number of simultaneous threads can be further limited if your kernel uses a lot of local memory or private registers, because those are a finite resource shared by all cores in a compute unit.
Work-groups will normally run on only one compute unit at a time, because local memory and barriers don't work across units. So you don't want to make your groups too large.
One final note: compute hardware tends to be grouped in powers of 2, so it's usually a good idea to make your work group sizes a multiple of e.g. 16 or 64. 1000 is neither, which usually means some cores will be doing nothing.
When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
Please be more precise in this question, it's not clear what you're asking.

Why is the CPU slower for calculations then the GPU when only Memory should matter?

A modern CPU has a ethash hashrate from under 1MH/s (source: https://ethereum.stackexchange.com/questions/2325/is-cpu-mining-even-worth-the-ether ) while GPUs mine with over 20MH/s easily. With overclocked memory they reach rates up to 30MH/s.
GPUs have GDDR Memory with Clockrates of about 1000MHz while DDR4 runs with higher clock speeds. Bandwith of DDR4 seems also to be higher (sources: http://www.corsair.com/en-eu/blog/2014/september/ddr3_vs_ddr4_synthetic and https://en.wikipedia.org/wiki/GDDR5_SDRAM )
It is said for Dagger-Hashimoto/ethash bandwith of memory is the thing that matters (also experienced from overclocking GPUs) which I find reasonable since the CPU/GPU only has to do 2x sha3 (1x Keccak256 + 1x Keccak512) operations (source: https://github.com/ethereum/wiki/wiki/Ethash#main-loop ).
A modern Skylake processor can compute over 100M of Keccak512 operations per second (see here: https://www.cryptopp.com/benchmarks.html ) so then core count difference between GPUs and CPUs should not be the problem.
But why don't we get about ~50Mhash/s from 2xKeccak operations and memory loading on a CPU?

See http://www.nvidia.com/object/what-is-gpu-computing.html for an overview of the differences between CPU and GPU programming.
In short, a CPU has a very small number of cores, each of which can do different things, and each of which can handle very complex logic.
A GPU has thousands of cores, that operate pretty much in lockstep, but can only handle simple logic.
Therefore the overall processing throughput of a GPU can be massively higher. But it isn't easy to move logic from the CPU to the GPU.
If you want to dive in deeper and actually write code for both, one good starting place is https://devblogs.nvidia.com/gpu-computing-julia-programming-language/.

"A modern Skylake processor can compute over 100M of Keccak512 operations per second" is incorrect, it is 140 MiB/s. That is MiBs per second and a hash operation is more than 1 byte, you need to divide the 140 MiB/s by the number of bytes being hashed.

I found an article addressing my problem (the influence of Memory on the algorithm).
It's not only the computation problem (mentioned here: https://stackoverflow.com/a/48687460/2298744 ) it's also the Memorybandwidth which would bottelneck the CPU.
As described in the article every round fetches 8kb of data for calculation. This results in the following formular:
(Memory Bandwidth) / ( DAG memory fetched per hash) = Max Theoreticical Hashrate
(Memory Bandwidth) / ( 8kb / hash) = Max Theoreticical Hashrate
For a grafics card like the RX470 mentioned this results in:
(211 Gigabytes / sec) / (8 kilobytes / hash) = ~26Mhashes/sec
While for CPUs with DDR4 this will result in:
(12.8GB / sec) / (8 kilobytes / hash) = ~1.6Mhashes/sec
or (debending on clock speeds of RAM)
(25.6GB / sec) / (8 kilobytes / hash) = ~3.2Mhashes/sec
To sum up, a CPU or also GPU with DDR4 ram could not get more than 3.2MHash/s since it can't get the data fast enough needed for processing.
Source:
https://www.vijaypradeep.com/blog/2017-04-28-ethereums-memory-hardness-explained/

Low GPU usage in CUDA

I implemented a program which uses different CUDA streams from different CPU threads. Memory copying is implemented via cudaMemcpyAsync using those streams. Kernel launches are also using those streams. The program is doing double-precision computations (and I suspect this is the culprit, however, cuBlas reaches 75-85% CPU usage for multiplication of matrices of doubles). There are also reduction operations, however they are implemented via if(threadIdx.x < s) with s decreasing 2 times in each iteration, so stalled warps should be available to other blocks. The application is GPU and CPU intensive, it starts with another piece of work as soon as the previous has finished. So I expect it to reach 100% of either CPU or GPU.
The problem is that my program generates 30-40% of GPU load (and about 50% of CPU load), if trusting GPU-Z 1.9.0. Memory Controller Load is 9-10%, Bus Interface Load is 6%. This is for the number of CPU threads equal to the number of CPU cores. If I double the number of CPU threads, the loads stay about the same (including the CPU load).
So why is that? Where is the bottleneck?
I am using GeForce GTX 560 Ti, CUDA 8RC, MSVC++2013, Windows 10.
One my guess is that Windows 10 applies some aggressive power saving, even though GPU and CPU temperatures are low, the power plan is set to "High performance" and the power supply is 700W while power consumption with max CPU and GPU TDP is about 550W.
Another guess is that double-precision speed is 1/12 of the single-precision speed because there are 1 double-precision CUDA core per 12 single-precision CUDA cores on my card, and GPU-Z takes as 100% the situation when all single-precision and double-precision cores are used. However, the numbers do not quite match.

Apparently the reason was low occupancy due to CUDA threads using too many registers by default. To tell the compiler the limit on the number of registers per thread, __launch_bounds__ can be used, as described here. So to be able to launch all 1536 threads in 560 Ti, for block size 256 the following can be specified:
_global__ void __launch_bounds__(256, 6) MyKernel(...) { ... }
After limiting the number of registers per CUDA thread, the GPU usage has raised to 60% for me.
By the way, 5xx series cards are still supported by NSight v5.1 for Visual Studio. It can be downloaded from the archive.
EDIT: the following flags have further increased GPU usage to 70% in an application that uses multiple GPU streams from multiple CPU threads:
cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost | cudaDeviceLmemResizeToMax);
cudaDeviceScheduleYield lets other threads execute when a CPU
thread is waiting on GPU operation, rather than spinning GPU for the
result.
cudaDeviceLmemResizeToMax, as I understood it, makes kernel
launches themselves asynchronous and avoids excessive local memory
allocations&deallocations.

why is mpi slower on my laptop

I am running MPI on my laptop (intel i7 quad core 4700m 12Gb RAM) and the efficiency drops even for codes that involve no inter-process communication. Obviously I cannot just throw 100 processes at it since my machine is only quad-core, but I thought that it should scale well up to 8 process (intel quad core simulates as 8???). For example consider the simple toy Fortran code:
program test
implicit none
integer, parameter :: root=0
integer :: ierr,rank,nproc,tt,i
integer :: n=100000
real :: s=0.0,tstart,tend
complex, dimension(100000/nproc) :: u=2.0,v=0.0
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nproc,ierr)
call cpu_time(tstart)
do tt=1,200000
v=0.0
do i=1,100000/nproc
v(i) = v(i) + 0.1*u(i)
enddo
enddo
call cpu_time(tend)
if (rank==root) then
print *, 'total time was: ',tend-tstart
endif
call MPI_FINALIZE(ierr)
end subroutine test2
For 2 processes it takes half the time, but even trying 4 processes (should be quarter of the time?) the result begins to become less efficient and for 8 processes there is no improvement whatsoever. Basically I am wondering if this is just because I am running on a laptop and has something to do with shared memory, or if I am making some fundamental mistake in my code. Thanks
Note: In the above example I manually change the nproc in the array declaration and the inner loop to be equal to the number of processors I am using.

A quad core processor, thanks to hyperthreading shows itself as having 8 threads, but physically they are just 4 cores. The other 4 are scheduled by the hardware itself using the free slots in the execution pipelines.
It happens that especially with compute intensive loads this approach does not pay at all, being often counter-productive too on extreme loads because of overheads and not always optimized cache usage.
You can try to disable hyperthreading in the BIOS and compare it: you will have just 4 threads, 4 cores.
Even going from 1 to 4 there are resources that are being in competition. In particular each core has its own L1 cache, but each pair of cores shares the L2 cache (2x256KB) and the 4 cores share the L3 cache.
And all the cores obviously share the memory channels.
So you cannot expect to have linear scaling occupying more and more cores, since they will have to balance the usage of the resources, that are dedicated to one core/one thread in the sequential case.
All of this without involving communications at all.
The same behavior happens on desktops/servers, in particular for memory-intensive loads, as the one in your test case.
For example it's less evident with matrix-matrix multiplies, that is compute-intensive: for a NxN matrix, you have O(N^2) memory accesses but O(N^3) floating point operations.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio