CUDA: Only one job to begin with - parallel-processing

Sorry for bad title. I could not come up with anything better.
Every example I have seen of CUDA programs has predefined data that is ready to be parallelized.
A common example is the sum of two matrices where the two matrices are already filled. But what about programs that generates new tasks. How do I model this in CUDA? How do I pass a result so other threads can begin working on it.
For example:
Say I run a kernel on one job. This job generates 10 new independant jobs. Each of them generates 10 new independant job and so on. This seems like a task that is highly parallel because each job is independant. The problem is I don't know how to model this in CUDA.
I have tried doing it in CUDA where I used a while loop in a kernel to keep polling if a thread could begin computation. Each thread was assigned a job. But that did not work. It seemed to ignore the while loop.
Code example:
On host:
fill ready array with 0
ready[0] = 1;
On device:
__global__ void kernel(int *ready, int *result)
{
int tid = threadIdx.x;
if(tid < N)
{
int condition = ready[tid];
while(condition != 1)
{
condition = ready[tid];
}
result[tid] = 3;// later do real computation
//children jobs is now ready to work
int childIndex = tid * 10;
if(childIndex < (N-10))
{
ready[childIndex + 1] = 1; ready[childIndex + 2] = 1;
ready[childIndex + 3] = 1; ready[childIndex + 4] = 1;
ready[childIndex + 5] = 1; ready[childIndex + 6] = 1;
ready[childIndex + 7] = 1; ready[childIndex + 8] = 1;
ready[childIndex + 9] = 1; ready[childIndex +10] = 1;
}
}
}

You will want to use multiple kernel calls. Once a kernel job has finished and generated the work units for its children, the children can be executed in another kernel. You don't want to poll with a while loop inside a cuda kernel anyways, even if it worked you would get terrible performance.
I would google the CUDA parallel reduction example. Shows how to decompose into multiple kernels. The only difference is instead of doing less work between kernels you will be doing more.

Seems like you can use the CUDA Dynamic Parallelism.
With this you can invoke a kernel inside another kernel, meaning, when the first kernel is over, and is done generating the 10 tasks, right before it's done, you can invoke the next kernel that will handle those tasks.

Related

Out of Order execution and loops

I've always had the same question in mind about instruction level parallelism and loops:
How does a cpu parallelize loops? Do they execute multiple successive iterations at once? Do they execute subsequent instructions independent on the loop or both?
Consider the following function for reversing an array as an example:
// pretend that uint64_t is strict-aliasing safe,
// like it was defined with GNU C __attribute__((may_alias, aligned(4)))
void reverse_SSE2(uint32_t* arr, size_t length)
{
__m128i* startPtr = (__m128i*)arr;
__m128i* endPtr = (__m128i*)(arr + (length - sizeof(__m128i) / sizeof(uint32_t)));
while (startPtr < (__m128i*)((uint32_t*)endPtr - (sizeof(__m128i) / sizeof(uint32_t) - 1)))
{
__m128i lo = _mm_loadu_si128(startPtr);
__m128i hi = _mm_loadu_si128(endPtr);
__m128i reverseLo = _mm_shuffle_epi32(lo, _MM_SHUFFLE(0, 1, 2, 3));
__m128i reverseHi = _mm_shuffle_epi32(hi, _MM_SHUFFLE(0, 1, 2, 3));
_mm_storeu_si128(startPtr++, reverseHi);
_mm_storeu_si128(endPtr--, reverseLo);
}
uint64_t* startPtr64 = (uint64_t*)startPtr;
uint64_t* endPtr64 = (uint64_t*)endPtr + 1;
if (startPtr64 < (uint64_t*)((uint32_t*)endPtr64 - (sizeof(uint64_t) / sizeof(uint32_t) - 1)))
{
uint64_t lo = *startPtr64;
uint64_t hi = *endPtr64;
lo = math.rol(lo, 32);
hi = math.ror(hi, 32);
*startPtr64++ = hi;
*endPtr64 = lo;
uint32_t* startPtr32 = (uint32_t*)startPtr64;
uint32_t* endPtr32 = (uint32_t*)math.max((uint64_t)((uint32_t*)endPtr64 - 1), (uint64_t)startPtr32);
uint32_t lo32 = *startPtr32;
uint32_t hi32 = *endPtr32;
*startPtr32 = hi32;
*endPtr32 = lo32;
}
else
{
uint32_t* startPtr32 = (uint32_t*)startPtr64;
uint32_t* endPtr32 = (uint32_t*)endPtr64 + 1;
while (endPtr32 > startPtr32)
{
uint32_t lo = *startPtr32;
uint32_t hi = *endPtr32;
*startPtr32++ = hi;
*endPtr32-- = lo;
}
}
}
This code could be rewritten for maximum performance in a few ways, depending on the answers to my questions (and yes, in this specific example it would be irrelevant but this is just an example; we could continue with vectors that (may) do overlapping stores, as long as they load data that hasn't already been reversed).
The while loop could be unrolled, since The throughput of the instructions used is higher than what one iteration issues. If the hardware "unrolls" the loop, that would not be necessary.
All of the code following the while loop could be rewritten to not depend on the startPtr and endPtr values and thus the while loop, by calculating the number of remainders and pointers from it. If the CPU cannot execute other instructions while looping, that would introduce additional overhead. If it can, the function would end with the while loop simultaneously.
If the code following the while loop does not execute in parallel to it, that code might better be moved to the top of the function, so that one loop iteration can start executing in parallel to that code, since the initial check is cheap to calculate. The potential of introducing another cache miss does not matter in this case.
As an additional question (since the code following the loop has a branch and another loop): How are flags registers handled in superscalar CPUs? Are there multiple physical ones?

CUDA kernel with single branch runs 1.5x faster than kernel without branch

I've got a strange performance inversion on filter kernel with and without branching. Kernel with branching runs ~1.5x faster than the kernel without branching.
Basically I need to sort a bunch of radiance rays then apply interaction kernels. Since there are a lot of accompanying data, I can't use something like thrust::sort_by_key() many times.
Idea of the algorithm:
Run a loop for all possible interaction types (which is five)
At every cycle a warp thread votes for its interaction type
After loop completion every warp thread knows about another threads with the same interaction type
Threads elect they leader (per interaction type)
Leader updates interactions offsets table using atomicAdd
Each thread writes its data to corresponding offset
I used techniques described in this Nvidia post https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
My first kernel contains a branch inside loop and runs for ~5ms:
int active;
int leader;
int warp_progress;
for (int i = 0; i != hit_interaction_count; ++i)
{
if (i == decision)
{
active = __ballot(1);
leader = __ffs(active) - 1;
warp_progress = __popc(active);
}
}
My second kernel use lookup table of two elements, use no branching and runs for ~8ms:
int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
const int masks[2] = { 0, ~0 };
int mask = masks[i == decision];
active |= (mask & __ballot(mask));
}
int leader = __ffs(active) - 1;
int warp_progress = __popc(active);
Common part:
int warp_offset;
if (lane_id() == leader)
warp_offset = atomicAdd(&interactions_offsets[decision], warp_progress);
warp_offset = warp_broadcast(warp_offset, leader);
...copy data here...
How can that be? Is there any way to implement such filter kernel so it will run faster than branching one?
UPD: Complete source code can be found in filter_kernel cuda_equation/radiance_cuda.cu at https://bitbucket.org/radiosity/engine/src
I think this is CPU programmer brain deformation. On CPU I expect performance boost because of eliminated branch and branch misprediction penalty.
But there is no branch prediction on GPU and no penalty, so only instructions count matters.
First I need to rewrite code to the simple one.
With branch:
int active;
for (int i = 0; i != hit_interaction_count; ++i)
if (i == decision)
active = __ballot(1);
Without branch:
int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
int mask = 0 - (i == decision);
active |= (mask & __ballot(mask));
}
In first version there are ~3 operations: compare, if and __ballot().
In second version there are ~5 operations: compare, make mask, __ballot(), & and |=.
And there are ~15 ops in common code.
Both loops runs for 5 cycles. It total 35 ops in first, and 45 ops in second. This calculation can explain performance degradation.

Unexpected slowdown using omp

I'm using OMP to try to get some speedup in a small kernel. It's basically just querying a vector of unordered_sets for membership. I tried to make an optimization, but surprisingly I got a slowdown, and am really curious why.
My first pass was:
vector<unordered_set<uint16_t> > setList = getData();
#pragma omp parallel for default(shared) private(i, j) schedule(dynamic, 50)
for(i = 0; i < size; i++){
for(j = 0; j < 500; j++){
count = count + setList[i].count(val[j]);
}
}
Then I thought I could maybe get a speedup by moving the setList[i] sub expression up one level of nesting and save it in a temp variable, by doing the following:
#pragma omp parallel for default(shared) private(i, j, currSet) schedule(dynamic, 50)
for(i = 0; i < size; i++){
currSet = setList[i];
for(j = 0; j < 500; j++){
count = count + currSet.count(val[j]);
}
}
I had thought this would maybe save a load each iteration of the "j" for loop, and get a speedup, but it actually SLOWED DOWN by about 3x. By this I mean the entire kernel took about 3 times as long to run. Thoughts on why this would occur?
Thanks!
Adding up a few integers is really not enough work to warrant starting threads for.
If you forget to add the reduction clause, you'll suffer from true sharing - all threads want to update that count variable at the same time. This makes all cores fight for the cache line containing tha variable, which will considerably impact your performance.
I just noticed that you set the schedule to be dynamic. You shouldn't. This workload can be divided at compile time already. So don't specify a schedule.
As has already been stated inter-loop dependencies, i.e. threads waiting for data from other threads, or data being accessed by multiple threads successively, can cause a paralleled program to experience slow down and should be avoided as a rule of thumb. Built in functions like reductions can collect individual results and compile them together in an optimised fashion.
Here is a good example of reduction being used in a similar case to yours from the university of Utah
int array[8] = { 1, 1, 1, 1, 1, 1, 1, 1};
int sum = 0, i;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < 8; i++) {
sum += array[i];
}
printf("total %d\n", sum);
source: http://www.eng.utah.edu/~cs4960-01/lecture9.pdf
as an aside: private variables need only be assigned when they are local variables inside a parallel region In both cases it is not necessary for i to be declared private.
see wikipedia: https://en.wikipedia.org/wiki/OpenMP#Data_sharing_attribute_clauses
Data sharing attribute clauses
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
see stack exchange answer here: OpenMP: are local variables automatically private?

OpenCL slow -- not sure why

I'm teaching myself OpenCL by trying to optimize the mpeg4dst reference audio encoder. I achieved a 3x speedup by using vector instructions on CPU but I figured the GPU could probably do better.
I'm focusing on computing auto-correlation vectors in OpenCL as my first area of improvement. The CPU code is:
for (int i = 0; i < NrOfChannels; i++) {
for (int shift = 0; shift <= PredOrder[ChannelFilter[i]]; shift++)
vDSP_dotpr(Signal[i] + shift, 1, Signal[i], 1, &out, NrOfChannelBits - shift);
}
NrOfChannels = 6
PredOrder = 129
NrOfChannelBits = 150528.
On my test file, this function take approximately 188ms to complete.
Here's my OpenCL method:
kernel void calculateAutocorrelation(size_t offset,
global const float *input,
global float *output,
size_t size) {
size_t index = get_global_id(0);
size_t end = size - index;
float sum = 0.0;
for (size_t i = 0; i < end; i++)
sum += input[i + offset] * input[i + offset + index];
output[index] = sum;
}
This is how it is called:
gcl_memcpy(gpu_signal_in, Signal, sizeof(float) * NrOfChannels * MAXCHBITS);
for (int i = 0; i < NrOfChannels; i++) {
size_t sz = PredOrder[ChannelFilter[i]] + 1;
cl_ndrange range = { 1, { 0, 0, 0 }, { sz, 0, 0}, { 0, 0, 0 } };
calculateAutocorrelation_kernel(&range, i * MAXCHBITS, (cl_float *)gpu_signal_in, (cl_float *)gpu_out, NrOfChannelBits);
gcl_memcpy(out, gpu_out, sizeof(float) * sz);
}
According to Instruments, my OpenCL implementation seems to take about 13ms, with about 54ms of memory copy overhead (gcl_memcpy).
When I use a much larger test file, 1 minute of 2-channel music vs, 1 second of 6-channel, while the measured performance of the OpenCL code seems to be the same, the CPU usage falls to about 50% and the whole program takes about 2x longer to run.
I can't find a cause for this in Instruments and I haven't read anything yet that suggests that I should expect very heavy overhead switching in and out of OpenCL.
If I'm reading your kernel code correctly, each work item is iterating over all of the data from it's location to the end. This isn't going to be efficient. For one (and the primary performance concern), the memory accesses won't be coalesced and so won't be at full memory bandwidth. Secondly, because each work item has a different amount of work, there will be branch divergence within a work group, which will leave some threads idle waiting for others.
This seems like it has a lot in common with a reduction problem and I'd suggest reading up on "parallel reduction" to get some hints about doing an operation like this in parallel.
To see how memory is being read, work out how 16 work items (say, global_id 0 to 15) will be reading data for each step.
Note that if every work item in a work group access the same memory, there is a "broadcast" optimization the hardware can make. So just reversing the order of your loop could improve things.

CUDA: different results from CPU

Some questions about CUDA.
1) I noticed that, in every sample code, operations which are not parallel (i.e., the computation of a scalar), performed in global functions, are always done specifying a certain thread. For example, in this simple code for a dot product, thread 0 performs the summation:
__global__ void dot( int *a, int *b, int *c )
{
// Shared memory for results of multiplication
__shared__ int temp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
// Thread 0 sums the pairwise products
if( 0 == threadIdx.x )
{
int sum = 0;
for( int i = 0; i < N; i++ )
sum += temp[i];
*c = sum;
}
}
This is fine for me; however, in a code which I wrote I did not specify the thread for the non-parallel operation, and it still works: hence, is it compulsory to define the thread? In particular, the non-parallel operation which I want to perform is the following:
if (epsilon == 1)
{
V[0] = B*(Exp - 1 - b);
}
else
{
V[0] = B*(Exp - 1 + a);
}
The various variables were passed as arguments of the global function. And here comes my second question.
2) I computed the value of V[0] with a program in CUDA and another serial on the CPU, obtaining different results. Obviously I thought that the problem in CUDA could be that I did not specify the thread, but, even with this, the result does not change, and it is still (much) greater from the serial one: 6.71201e+22 vs -2908.05. Where could be the problem? The other calculations performed in the global function are the following:
int tid = threadIdx.x;
if ( tid != 0 && tid < N )
{
{Various stuff which does not involve V or the variables used to compute V[0]}
V[tid] = B*(1/(1+alpha[tid]*alpha[tid])*(One_G[tid]*Exp - Cos - alpha[tid]*Sin) + kappa[tid]*Sin);
}
As you can see, in my condition I avoid to consider the case tid == 0.
3) Finally, a last question: usually in the sample codes I noticed that, if you want to use on the CPU values allocated and computed on the GPU memory, you should copy those values on the CPU (e.g, with command cudaMemcpy, specifying cudaMemcpyDeviceToHost). But I manage to use those values directly in the main code (CPU) without any problem. Can be this a clue that there is something wrong with my GPU (or my installation of CUDA), which also causes the previous odd things?
Thank you for your help.
== Added on the 5th January ==
Sorry for the late of my reply. Before invoking the kernel, there are all the memory allocations of the arrays to compute (which are quite a lot). In particular, the code about the array involved in my question is:
float * V;
cudaMalloc( (void**)&V, N * sizeof(float) );
At the end of the code I wrote:
float V_ [N];
cudaMemcpy( &V_, V, N * sizeof(float), cudaMemcpyDeviceToHost );
cudaFree(V);
cout << V_[0] << endl;
Thank you again for your attention.
if you don't have any cudaMemcpy in your code, that's exactly the problem. ;-)
The GPU is accessing it's own memory (the RAM on your graphics card), while the CPU is accessing the RAM on your mainboard.
You need to allocate and copy alpha, kappa, One_g and all other arrays to your GPU first, using cudaMemcpy, then run your kernel and after that copy your results back to the CPU.
Also, don't forget to allocate the memory on BOTH sides.
As for the non-parallel stuff: If the result is always the same, all threads will write the same thing, so the result is exactly the same, just quite a bit more inefficient, since all of them try to access the same resources.
Is that the exact code you're using?
In regards to question 1, you should have a __syncthreads() after the assignment to your shared memory, temp.
Otherwise you'll get a race condition where thread 0 can start the summation prior to temp being fully populated.
As for your other question about specifying the thread, if you have
if (epsilon == 1)
{
V[0] = B*(Exp - 1 - b);
}
else
{
V[0] = B*(Exp - 1 + a);
}
Then every thread will execute that code; for example, if you have X number of threads executing, and epsilon is 1 for all of them, then all X threads will evaluate the same line:
V[0] = B*(Exp - 1 - b);
and hence you'll have another race condition, as you'll have all X threads writing to V[0]. If all the threads have the same value for B*(Exp - 1 - b), then you might not notice a difference, while if they have different values then you're liable to get different results each time, depending on what order the threads arrive

Resources