Dividing up CUDA cudaMemcpy into chunks

Dividing up CUDA cudaMemcpy into chunks - time

A co-worker and I were brainstorming on how to mitigate the memory transfer time between host and device and it came up that perhaps arranging things to one mega-transfer (i.e. one single call) might help. This led me to create a test case where I took timings of transferring few large data chunks vs. many small data data chunks. I got some very interesting/strange results, and was wondering if anyone here had an explanation?
I won't put my whole code up here since it's quite long, but I tested the chunking in two different ways:
Explicitly writing out all cudaMemcpy's, e.g.:
cudaEventRecord(start, 0);
cudaMemcpy(aD, a, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 1*nBytes/10, a + 1*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 2*nBytes/10, a + 2*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 3*nBytes/10, a + 3*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 4*nBytes/10, a + 4*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 5*nBytes/10, a + 5*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 6*nBytes/10, a + 6*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 7*nBytes/10, a + 7*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 8*nBytes/10, a + 8*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 9*nBytes/10, a + 9*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
Putting the cudaMemcpy's into a for loop:
cudaEventRecord(start, 0);
for(int i = 0; i < nChunks; i++)
{
cudaMemcpy(aD + i*nBytes/nChunks, a + i*nBytes/nChunks, nBytes/nChunks,
cudaMemcpyHostToDevice);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
To note, I also did a "warm-up" transfer at the start of each test just in case, though I don't think it was needed (the context was created by a cudaMalloc call).
I tested this on total transfer sizes ranging from 1 MB to 1 GB, where each test case transferred the same amount of information regardless of how it was chunked up. A sample of my output is this:
single large transfer = 0.451616 ms
10 explicit transfers = 0.198016 ms
100 explicit transfers = 0.691712 ms
10 looped transfers = 0.174848 ms
100 looped transfers = 0.683744 ms
1000 looped transfers = 6.145792 ms
10000 looped transfers = 104.981247 ms
100000 looped transfers = 13097.441406 ms
What's interesting here and what I don't get is that, across the board, the 10 transfers were ALWAYS faster by a significant amount than any of the others, even the single large transfer! And that result stayed consistent no matter how large or small the data set was (i.e. 10x100MB vs 1x1GB or 10x1MB vs 1x10MB still results in the 10x being faster). If anyone has any insight on why this is or what I may be doing wrong to get these weird numbers, I would be very interested to hear what you have to say.
Thanks!
P.S. I know that cudaMemcpy carries with it an implicit synchronization and so I could have used a CPU timer and that cudaEventSynchronize is redundant, but I figured it was better to be on the safe side
UPDATE: I wrote a function to try and take advantage of this apparent rip in the performance space-time continuum. When I use that function, though, which is written EXACLTY as in my test cases, the effect goes away and I see what I expect (a single cudaMemcpy is fastest). Perhaps this is all more akin to quantum physics than relativity wherein the act of observing changes the behavior...

cudaMemcpy() is synchronous - CUDA waits until the memcpy is done before returning to your app.
If you call cudaMemcpyAsync(), the driver will return control to your app before the GPU necessarily has performed the memcpy.
It's critical that you call cudaMemcpyAsync() instead of cudaMemcpy(). Not because you want to overlap the transfers with GPU processing, but because that is the only way you will get CPU/GPU concurrency.
On a cg1.4xlarge instance in Amazon EC2, it takes ~4 microseconds for the driver to request a mempy of the GPU; so CPU/GPU concurrency is a good way to hide driver overhead.
I don't have a ready explanation for the disparity you are seeing at 10 - the main knee I'd expect to see is where the memcpy crosses over 64K in size. The driver inlines memcpy's smaller than 64K into the same buffer used to submit commands.

Use the cudaThreadSynchronize() before and after each cuda call to get the real memory transfer time, cudaMemcpy() is synchronous but not with the CPU execution, it depends on the function called.
Cuda function calls are synchronous with other cuda function calls like other memory transfers or kernel execution, this is managed in a different CUDA thread invisible to the CUDA developer. cudaMemcpyAsync() is asynchronous with other CUDA calls, that is why it needs that the GPU memory segments copied do not overlap with other concurrent memory transfers.
Are you sure that in this case cudaMemcpy(), which is synchronous in the CUDA execution thread, is being synchronous also with the CPU thread? Well depending of the cuda function it can be or not, but if you use the cudaThreadSynchronize function when measuring times it will be synchronous with the CPU for sure, and the real times of each step will appear.

Perhaps it is some peculiarity in how CUDA measures time. You are measuring times which are less than 1 ms, which is very small.
Did you try to time it with CPU based timer and compare results?

Related

How to properly implement waiting of async computations?

i have some little trouble and i am asking for hint. I am on Windows platform, doing calculations in a following manner:
int input = 0;
int output; // junk bytes here
while(true) {
async_enqueue_upload(input); // completes instantly, but transfer will take 10us
async_enqueue_calculate(); // completes instantly, but computation will take 80us
async_enqueue_download(output); // completes instantly, but transfer will take 10us
sync_wait_finish(); // must wait while output is fully calculated, and there is no junk
input = process(output); // i cannot launch next step without doing it on the host.
}
I am asking about wait_finish() thing. I must wait all devices to finish, to combine all results and somehow process the data and upload a new portion, that is based on a previous computation step. I need to sync data in between each step, so i can't parallelize steps. I know, this is not quite performant case. So lets proceed to question.
I have 2 ways of checking completion, within wait_finish(). First is to put thread to sleep until it wakes up by completion event:
while( !is_completed() )
Sleep(1);
It has very low performance, because actual calculation, to say, takes 100us, and minimal Windows sheduler timestep is 1ms, so it gives unsuitable 10x lower performance.
Second way is to check completion in empty infinite loop:
while( !is_completed() )
{} // do_nothing();
It has 10x good computation performance. But it is also unsuitable solution, because it makes full cpu core utilisation usage, with absolutely useless work. How to make cpu "sleep" exactly time i needed? (Each step has equal amount of work)
How this case is usually solved, when amount of calculation time is too big for active spin-wait, but is too small compared to sheduler timestep? Also related subquestion - how to do that on linux?

Fortunately, i have succeeded in finding answer on my own. In short words - i should use linux for that.
And my investigation shows following. On windows there is hidden function in ntdll, NtDelayExecution(). It is not exposed through SDK, but can be loaded in a following manner:
static int(__stdcall *NtDelayExecution)(BOOL Alertable, PLARGE_INTEGER DelayInterval) = (int(__stdcall*)(BOOL, PLARGE_INTEGER)) GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "NtDelayExecution");
It allows to set sleep intervals in 100ns periods. However, even that not worked well, as shown in a following benchmark:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); // requires Admin privellegies
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
uint64_t hpf = qpf(); // QueryPerformanceFrequency()
uint64_t s0 = qpc(); // QueryPerformanceCounter()
uint64_t n = 0;
while (1) {
sleep_precise(1); // NtDelayExecution(-1); waits one 100-nanosecond interval
auto s1 = qpc();
n++;
auto passed = s1 - s0;
if (passed >= hpf) {
std::cout << "freq=" << (n * hpf / passed) << " hz\n";
s0 = s1;
n = 0;
}
}
That yields something less than just 2000 hz loop rate, and result varies from string to string. That led me towards windows thread switching sheduler, which is totally not suited for real time tasks. And its minimum interval of 0.5ms (+overhead). Btw, does anyone knows on how to tune that value?
And next was linux question, and what does it can? So i've built custom tiny kernel 4.14 with means of buildroot, and tested that benchmark code there. I replaced qpc() to return clock_gettime() data, with CLOCK_MONOTONIC clock, and qpf() just returns number of nanoseconds in a second and sleep_precise() just called clock_nanosleep(). I was failed to find out what is the difference between CLOCK_MONOTONIC and CLOCK_REALTIME.
And i was quite surprised, getting whooping 18.4khz frequency just out of the box, and that was quite stable. While i tested several intervals, i found that i can set the loop to almost any frequency up to 18.4khz, but also that actual measured wait time results differs to 1.6 times of what i asked. For example if i ask to sleep 100 us it actually sleeps for ~160 us, giving ~6.25 khz frequency. Nothing else is going on the system, just kernel, busybox and this test. I am not an experience linux user, and i am still wondering how can i tune this to be more real-time and deterministic. Can i push that frequency maximum even more?

Slow sorting using Thrust, CUDA

I am a newbie to CUDA. I simply tried to sort an array using Thrust.
clock_t start_time = clock();
thrust::host_vector<int> h_vec(10);
thrust::generate(h_vec.begin(), h_vec.end(), rand);
thrust::device_vector<int> d_vec = h_vec;
thrust::sort(d_vec.begin(), d_vec.end());
//thrust::sort(h_vec.begin(), h_vec.end());
clock_t stop_time = clock();
printf("%f\n", (double)(stop_time - start_time) / CLOCKS_PER_SEC);
Time took to sort d_vec is 7.4s, and time took to sort h_vec is 0.4s
I am assuming its parallel computation on device memory, so shouldn't it be faster ?

Probably the main problem is context creation time: the first CUDA call will initialize the CUDA context which takes some time, see here. Therefore you should start measuring time only after the first CUDA call.
In general you can only expect speed-up with GPU code compared to CPU code if the degree of parallelism is high enough. The vector size of 10 as in the example code is definitely too small to achieve speed-up. With a vector size >> 10000 you can expect to fully utilize a modern GPU.
You should also think about measuring only the time for sorting without the copy d_vec = h_vec, since often you will work with the device vector in the next step. Then you can consider the copy operation as a one time setup cost. (However if sorting is the only operation on device it is of course reasonable to include the memcopy in the measurement.)

Understanding Memory Replays and In-Flight Requests

I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2)
__global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) {
unsigned int ix = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;
if (ix < nx && iy < ny) {
out[iy*nx + ix] = in[ix*ny + iy]; //
// out[ix*ny + iy] = in[iy*nx + ix]; // for by row
}
}
This is what I don't understand: The load throughput for for transposeNaiveCol() is 642.33 GB/s and for tranposeNaiveRow() is 129.05 GB/s. The author says:
The results show that the highest load throughput is obtained with
cached, strided reads. In the case of cached reads, each memory
request is serviced with a 128-byte cache line. Reading data by
columns causes each memory request in a warp to replay 32 times
(because the stride is 2048 data elements), resulting in good latency
hiding from many in-flight global memory reads and then excellent L1
cache hit ratios once bytes are pre-fetched into L1 cache.
My question:
I thought that aligned/coalesced reads were ideal, but here it seems that strided reads improve performance.
Why is reading a cache line conducive to reduced performance in
this case?
Aren't replays in general a bad thing? It mentions here that it results in "good latency hiding".

Effective load throughput is not the only metric that determines the performance of your kernel! A kernel with perfectly coalesced loads will always have a lower effective load throughput than the equivalent, non coalesced kernel, but that alone says nothing about its execution time: in the end, the one metric that really matters is the wall clock time that your kernel takes to completion, of which the authors make no mention.
That being said, kernels usually fall into two categories:
Compute bound kernels, whose performance can be increased by trying to hide instruction latency: keep the pipeline full (maximize ILP).
I/O bound kernels, whose performance can be increased by trying to hide memory latency: keep data in flight (maximize bandwidth).
Matrix transpose being of very low compute intensity, it is therefore I/O bound, and as such to get better performance you should try to increase bandwidth usage.
Why is the column transpose better at maximizing bandwidth usage?
In the case of the row transpose, reads are coalesced: a single 128 bytes transaction is served per warp, that is 4 bytes per thread. Those 128 bytes are put in cache but are never reused, so the cache is effectively of no use in this case.
In the case of the column transpose, reads are not coalesced: each warp gets served 32 transactions of 128 bytes, all of which will get into L1 and will be reused for the next 31 replays (assuming they didn't get kicked out of cache). That is very low load efficiency for very high effective load throughput, and maximal cache usage.
You could of course get the same effect in the row transpose by simply requesting more data per thread (for example by loading 32 float, or 8 float4 per thread) or using CUDA's prefetch capabilities.

Why is this simple OpenCL kernel running so slowly?

I'm looking into OpenCL, and I'm a little confused why this kernel is running so slowly, compared to how I would expect it to run. Here's the kernel:
__kernel void copy(
const __global char* pSrc,
__global __write_only char* pDst,
int length)
{
const int tid = get_global_id(0);
if(tid < length) {
pDst[tid] = pSrc[tid];
}
}
I've created the buffers in the following way:
char* out = new char[2048*2048];
cl::Buffer(
context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
length,
out);
Ditto for the input buffer, except that I've initialized the in pointer to random values. Finally, I run the kernel this way:
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(1),
NULL,
&event);
event.wait();
On average, the time is around 75 milliseconds, as calculated by:
cl_ulong startTime = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong endTime = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << (endTime - startTime) * SECONDS_PER_NANO / SECONDS_PER_MILLI << "\n";
I'm running Windows 7, with an Intel i5-3450 chip (Sandy Bridge architecture). For comparison, the "direct" way of doing the copy takes less than 5 milliseconds. I don't think the event.getProfilingInfo includes the communication time between the host and device. Thoughts?
EDIT:
At the suggestion of ananthonline, I changed the kernel to use float4s instead of chars, and that dropped the average run time to about 50 millis. Still not as fast as I would have hoped, but an improvement. Thanks ananthonline!

I think your main problem is the 2048*2048 work groups you are using. The opencl drivers on your system have to manage a lot more overhead if you have this many single-item work groups. This would be especially bad if you were to execute this program using a gpu, because you would get a very low level of saturation of the hardware.
Optimization: call your kernel with larger work groups. You don't even have to change your existing kernel. see question: What should this size be? I have used 64 below as an example. 64 happens to be a decent number on most hardware.
cl::size_t myOptimalGroupSize = 64;
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(myOptimalGroupSize),
NULL,
&event);
event.wait();
You should also get your kernel to do more than copy a single value. I have given an answer to a similar question about global memory over here.

CPUs are very different from GPUs. Running this on an x86 CPU, the best way to achieve decent performance would be to use double16 (the largest data type) instead of char or float4 (as suggested by someone else).
In my little experience with OpenCL on CPU, I have never reached performance levels that I could get with an OpenMP parallelization.
The best way to do a copy in parallel with a CPU would be to divide the block to copy into a small number of large sub-block, and let each thread copy a sub-block.
The GPU approach is orthogonal: each thread participates in the copy of the same block.
This is because on GPUs, different thread can access contiguous memory regions efficicently (coalescing).
To do an efficient copy on CPU with OpenCL, use a loop inside your kernel to copy contiguous data. And then use a workgroup size not larger than the number of available cores.

I believe it is the cl::NDRange(1) which is telling the runtime to use single item work groups. This is not efficient. In the C API you can pass NULL for this to leave the work group size up to the runtime; there should be a way to do that in the C++ API as well (perhaps also just NULL). This should be faster on the CPU; it certainly will be on a GPU.

Why is file I/O in large chunks SLOWER than in small chunks?

If you call ReadFile once with something like 32 MB as the size, it takes noticeably longer than if you read the equivalent number of bytes with a smaller chunk size, like 32 KB.
Why?
(No, my disk is not busy.)
Edit 1:
Forgot to mention -- I'm doing this with FILE_FLAG_NO_BUFFERING!
Edit 2:
Weird...
I don't have access to my old machine anymore (PATA), but when I tested it there, it took around 2 times as long, sometimes more. On my new machine (SATA), I'm only getting a ~25% difference.
Here's a piece of code to test:
#include <memory.h>
#include <windows.h>
#include <tchar.h>
#include <stdio.h>
int main()
{
HANDLE hFile = CreateFile(_T("\\\\.\\C:"), GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
OPEN_EXISTING, FILE_FLAG_NO_BUFFERING /*(redundant)*/, NULL);
__try
{
const size_t chunkSize = 64 * 1024;
const size_t bufferSize = 32 * 1024 * 1024;
void *pBuffer = malloc(bufferSize);
DWORD start = GetTickCount();
ULONGLONG totalRead = 0;
OVERLAPPED overlapped = { 0 };
DWORD nr = 0;
ReadFile(hFile, pBuffer, bufferSize, &nr, &overlapped);
totalRead += nr;
_tprintf(_T("Large read: %d for %d bytes\n"),
GetTickCount() - start, totalRead);
totalRead = 0;
start = GetTickCount();
overlapped.Offset = 0;
for (size_t j = 0; j < bufferSize / chunkSize; j++)
{
DWORD nr = 0;
ReadFile(hFile, pBuffer, chunkSize, &nr, &overlapped);
totalRead += nr;
overlapped.Offset += chunkSize;
}
_tprintf(_T("Small reads: %d for %d bytes\n"),
GetTickCount() - start, totalRead);
fflush(stdout);
}
__finally { CloseHandle(hFile); }
return 0;
}
Result:
Large read: 1076 for 67108864 bytes
Small reads: 842 for 67108864 bytes
Any ideas?

Your test is including the time it take to read in file metadata, specifically, the mapping of file data to disk. If you close the file handle and re-open it, you should get similar timings for each. I tested this locally to make sure.
The effect is probably more severe with heavy fragmentation, as you have to read in more file to disk mappings.
EDIT: To be clear, I ran this change locally, and saw nearly identical times with large and small reads. Reusing the same file handle, I saw similar timings from the original question.

This is not specific to windows. I did some tests a while back with the C++ iostream library and found there was an optimum buffer size for reads, above which performance degraded. Unfortunately, I no longer have the tests, and I can't remember what the size was :-). As to why, well there are a lot of issues, such as a large buffer possibly causing paging in other applications running at the same time (as the buffer can't be paged).

When you perform the 1024 * 32KB reads are you reading into the same memory block over and over, or are you allocating a total of 32MB to rad into as well and filling the entire 32MB?
If you're reading the smaller reads into the same 32K block of memory, then the time difference is probably simply that Windows doesn't have to scavenge up the additional memory.
Update based on the FILE_FLAG_NO_BUFFERING addition to the question:
I'm not 100% certain, but I believe that when FILE_FLAG_NO_BUFFERING is used, Windows will lock the buffer into physical memory so it can allow the device driver to deal with physical addresses (such as to DMA directly into the buffer). It could (I believe) do this by breaking up a large request into smaller requests, but I suspect that Microsoft might have the philosophy that "if you ask for FILE_FLAG_NO_BUFFERING then we assume you know what you're doing and we're not going to get in your way".
Of course locking 32MB all at once instead of 32KB at a time will require more resources. So this would be kind of like my initial guess, but at the physical memory level rather than the virtual memory level.
However, since I don't work for MS and don't have access to Windows source, I'm going by vague recollection from times when I worked closer with the Windows kernel and device driver model (so this is more or less speculation).

when you have done FILE_FLAG_NO_BUFFERING that means that the operating system will not buffer the I/O. So each time you call the read function it will make a system call which will fetch each time the data from the disk. Then to read one file with a fixed size if you use less buffer size then more system calls are needed so more user space to kernel space and for each time a disk I/O is initiated. Instead if you use larger block size then for the same file size to be read there would be less system calls required so the user to kernel space switches would be lesser, and the number of times the disk i/O initiated will also be lesser. This is why, generally larger block will require less time to read.
Try reading the file only 1 byte at a time without buffering, and try with 4096bytes block then and see the difference.

A possible explanation in my opinion would be command queueing with FILE_FLAG_NO_BUFFERING, since this does direct DMA reads at low level.
A single large request will of course still necessarily be broken into sub-requests, but those will likely be sent more or less one after another (because the driver needs to lock the pages and will in all likelihood be reluctant to lock several megabytes lest it hits the quota).
On the other hand, if you throw a dozen or two dozen requests at the driver, it will just forward them to the disk and the disk and take advantage of NCQ.
Well, that's what I'm thinking might be the reason anyway (this does not explain why the exact same phenomenon happens with buffered reads though, as in the Q that I linked to above).

What you are probably observing is that when using smaller blocks, the second block of data can be read while the first is being processed, then the third read while the second is being processed, etc. so that the speed limit is the slower of the physical read time or the processing time. If it takes the same amount of time to process one block as to read the next, the speed could be double what it would be if processing and reading were separate. When using larger blocks, the amount of data that is read while the first block is being processed will be limited to amount smaller than the block size. When the code is ready for the next block of data, part of it will have been read but some of it will not; it will thus be necessary for the code to wait while the remainder of the data is fetched.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio