Fastest way to simulate recurrent equations in OpenCL - performance

I am trying to simulate a recurrent equation of the type
si(t+1) = f[Σj Wijsj(t) + vi*input(t)]
in OpenCL, where f(.) is some non-linear function (in the code below it is just a step-function with threshold th) and s(t) is some external input. Naturally, I implemented one worker for every xi. In every time-step every worker calculates the result of the equation above and subsequently this result is shared with all other workers. Therefore, all workers have to be in the same workgroup.
My current OpenCL kernel looks like this
__kernel void part1(__global int* s, __global float* W, __global float* Th, __global float* V, __global float* input, int N, int T)
{
unsigned int i = get_global_id(0);
float value = 0;
float v = V[i];
float th = Th[i];
for(int t = 0; t < T; t++){
value = v*input[t];
for(int j = 0; j < N; j++){
value = value + W[i*N + j]*s[j];
}
barrier(CLK_GLOBAL_MEM_FENCE);
if (value >= th){
s[i] = 1;
} else {
s[i] = 0;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
Unfortunately, this code is actually three times slower than an equivalent C-implementation. Also, I expected that a change in the number of workers should not make a huge difference (because new workers are sitting on new threads that run in parallel to the others), but actually the processing time increases linearly with the number of workers. The bottleneck seems to be the writing operation after the first barrier. Eliminating this operation (but leaving the barrier in place) cuts down the processing time by a factor of 25 and eliminates the linear dependence.
I am pretty new to OpenCL and I would appreciate any help to speed this code up!
Thanks a lot in advance!
Blue2script

As I already stated in my comment, accessing global memory is slow. Usually the hardware hides the latency by having several subgroups of threads running on the same compute unit. The subgroups I'm referring to are call warps in the NVIDIA lingo and wavefronts in the AMD one. Usually a workgroup is composed of several subgroups.
So meanwhile one subgroup waits to receive the data from the global memory another one that already has all the necessary resources can run. When the running one gets stalled because it needs to read/write data from/to the global memory another one can start running and so on.
However, in your case, because of the barriers, all the workers in all the subgroups will have to that the others wrote in the memory before being able to continue the computation (the barrier is at the workgroup level). Hence the latency hits you right in the face :).
Now, a way to improve your implementation would be to use local memory and this time a barrier at the local memory level (with the flag CLK_LOCAL_MEM_FENCE). The same principle I've just explained still applies but accessing local memory is much faster.
As far as I understand your code (quite likely I didn't get all the subtleties), your s array has N elements, and I guess that you also have N workers. So you create a local array of N elements and you do something like that:
kernel void part1(global int* s, global float* W, global float* Th, global float* V, global float* input, local int* local_s, int N, int T)
{
unsigned int i = get_global_id(0);
unsigned int local_i = get_local_id(0);
float value = 0;
float v = V[i];
float th = Th[i];
//fetch from global to local and sync before computing
local_s[local_i] = s[i];
barrier(CLK_LOCAL_MEM_FENCE);
for(int t = 0; t < T; t++){
value = v*input[t];
for(int j = 0; j < N; j++){
value = value + W[i*N + j]*local_s[j];
}
barrier(CLK_LOCAL_MEM_FENCE);
if (value >= th){
local_s[i] = 1;
} else {
local_s[i] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//If necessary write some stuff to global (maybe the last s computed?)
}
Now I have to warn you:
I might have completely misunderstood your needs :)
I've just edited your code while typing this answer, so there are most probably typos and such.
Even using local memory, having so many barriers might still make the opencl version slower than the C one.
Note that I removed the leading __ since they are not necessary and it is easier to read in my opinion.
EDIT: Regarding your comment about CLK_LOCAL_MEM_FENCE vs. CLK_GLOBAL_MEM_FENCE. A barrier is always applied at the workgroup level, so all workers within a workgroup have to hit that barrier. The flag given as parameter refers to memory access. When the flag is CLK_GLOBAL_MEM_FENCE it means that every read/write operation regarding the global memory has to be completed by every worker before any worker can continue to run the next statements. This is exactly the same with the CLK_LOCAL_MEM_FENCE flag but for the local memory.

Related

How to use CUDA with C to speed up a piece of C code?

This is the device code I have written so far.
__global__ void syndrom(int *d_s, int *d_cx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x + 1;
int t2 = 5460;
int N_BCH = 16383;
if (tid <= t2) {
d_s[Usetid] = 0;
for (int j = 0; j < N_BCH; j ++) {
if (d_cx[j] != 0) {
d_s[tid] ^= d_alpha_to[(tid * j) % N_BCH];
}
}
d_s[tid] = d_index_of[d_s[tid]];
}
}
I call it in the host
dim3 grid(96);
dim3 block(256);
But the speed is not very good, I want to get help. Thanks.
This is not a Complete and Verifiable Example, which you are expected to provide here on StackOverflow (for example - what is d_alpha_to?), but I can still make a few suggestions:
Use more threads instead of having each thread iterate a very large number of times. They way GPU work parallelizes is saturating the processors with threads which are ready to perform more computation.
Don't operate on (the same place in) global memory repeatedly. Put d_s[tid] in a local variable (which will be placed in a register), work on it there, and when you're done, write it back. Accessing global memory is obviously much much slower than accessing registers.
Decorate your pointers with __restrict__ (and make d_cx a const pointer). Read more about __restrict__ here.

Optimize Cuda Kernel time execution

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.
Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__ ​ unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.
I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.
Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

OpenCL very low GFLOPS, no data transfer bottleneck

I am trying to optimize an algorithm I am running on my GPU (AMD HD6850). I counted the number of floating point operations inside my kernel and measured its execution time. I found it to achieve ~20 SP GFLOPS, however according to the GPUs specs I should achieve ~1500 GFLOPS.
To find the bottleneck I created a very simple kernel:
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
float cd;
for (int i=0; i<100000; i++)
{
cd = d*i;
}
if (cd == -1.0f)
{
result[gid] = cd;
}
}
Running this kernel I get ~5*10^5 work_items/sec. I count one floating point operation (not sure if that's right, what about incrementing i and comparing it to 100000?) per iteration of the loop.
==> 5*10^5 work_items/sec * 10^5 FLOPS = 50 GFLOPS.
Even if there are 3 or 4 operations going on in the loop, it's much slower than the what the card should be able to do. What am I doing wrong?
The global work size is big enough (no speed change for 10k vs 100k work items).
Here are a couple of tricks:
GPU doesn't like cycles at all. Use #pragma unroll to unwind them.
Your GPU is good at vector operations. Stick to it, that will allow you to process multiple operands at once.
Use vector load/store whether it's possible.
Measure the memory bandwidth - I'm almost sure that you are bandwidth-limited because of poor access pattern.
In my opinion, kernel should look like this:
typedef union floats{
float16 vector;
float array[16];
} floats;
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
floats cd;
cd.vector = vload16(gid, result);
cd.vector *= d;
#pragma unroll
for (int i=0; i<16; i++)
{
if(cd.array[i] == -1.0f){
result[gid] = cd;
}
}
Make your NDRange bigger to compensate difference between 16 & 1000 in loop condition.

Improving an OpenCL kernel for a Perceptron neural network

I've been doing a lot of OpenGL and shaders before, and now, I decided to give a try to OpenCL. I watched some online tutorials, and started reading books on the subject. In order to better understand, and because I believe that the best way to learn is by intelligently trying and learning from the issues that arose while doing so, I decided to start implementing a kernel for a fully-connected perceptron.
For those who don't know what that is, I'll explain the basic idea. It is a neural network in which each neuron of a layer is connected to every neurons of the next layer. Each neuron has but one action to perform: performing the sum of all the neurons from the previous layer, weighted by a different value for each neuron.
This seemed simple enough to implement, and after reading the paper "Parallel Neural Network Training with OpenCL" I implemented it in the following way
Each layer being dependent on the previous one, they're being run sequentially by the host
For computing a layer, I run my kernel with a global work size of the number of neurons within the layer (which can be quite huge, tens of thousand for instance). That makes it so that all the neurons are performing its sum independently to one another.
Each neuron (identified by its global_work_id) performs the weighted sum with all the neurons from the previous layer.
Here is my fully functional opencl kernel:
/**
* #brief Computes one layer of the perceptron given the previous one and the
* weights
* The kernel is run once for each layer.
* The work items are each tasked with computing the output of a single neuron
* of the out layer.
*
* #param out_layer_size
* Size of the output layer (number of elements in the output array that will
* contain the result for each neuron).
* #param in_layer_size
* Number of elements of the input layer
* #param in_value
* Values of the neuron in the previous layer
* #param in_weights
* Array containing the weights for each input neuron. It is organised as a
* two dimensional matrix, written by concatenating each line in the array
* [ w11, w12, w13, ...
* w21, w22, w23, ...
* ..., ..., ..., ...
* ]
* Where wij is the weight linking the neuron i of the input layer to the
* neuron j of the output layer
* #param out_values
* Computed values for the current layer
*/
void kernel perceptron(global const int* in_layer_size, global const int* out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
private const int global_id = get_global_id(0);
private const int out_layer_s = *out_layer_size;
private const int in_layer_s = *in_layer_size;
private const int offset = out_layer_s * global_id;
private float sum = 0.;
for(int i=0; i < in_layer_s; i++) {
sum += in_weights[i*out_layer_s+global_id] * in_value[i];
}
//out_values[global_id] = sigma(sum);
out_values[global_id] = sum;
}
And here is how I invoke it:
queue.enqueueNDRangeKernel(kernel, cl::NullRange,cl::NDRange(number of neurons within layer),cl::NullRange);
I realize that the bottleneck of this kernel is the implementation of the weighted sum. It would be really helpful if someone could explain how I could improve upon this to make it faster.
I probably don't make proper use of the different memory regions, I'm thinking essentially of the local memory that I don't even use.
Just to give you an idea of performance (that is on an Nvidia GTX 660M), I'll show you some of the times I achieved. Each value is the number of neurons per layer:
2500, 10 000, 2500 : 0.018s ~ 60FPS. It's about 4 to 5 times faster than on my processor (Intel Core i7 running at 2.40GHz)
100 000, 100 000, 500: 140s -> which I guess isn't surpsising since each neuron in the second layer has to perform the weighted sum of 100 000 elements. Running this on my processor yields about the same results.
As you told, bottleneck is the weighted summ. That's not hard to be, as at each layer every WI (Work Item) is doing a lot of IO operations in comparison to number of arithmetic operations. I have no experience in neural networks, but for me problem looks like poor memory access pattern on GPU.
Potentially, that can be solved by organizing your WI into local WGs (Work Groups). As every WI needs to process all data from prev. layer, I guess that all WI in WG can load some amount of data into local memory, process them and than to next bunch of data. This will make your algorithm much more cache friendly. Pseudo-code of kernel looks like:
void kernel Kernel(
__global const int in_layer_size,
__global const int out_layer_size,
__global const float *in_value,
__global const float *in_weights,
__global float *out_values){
__local float buffer[SOME_SIZE];
__global const float* p_in = in_value;
__global float* p_out = out_values;
const int
global_id = get_global_id(0),
local_id = get_local_id(0),
num_buffers = in_layer_size / SOME_SIZE,
offset = out_layer_size * global_id;
float sum = 0.0f;
for(int i=0; i < num_buffers; i++){
buffer[local_id] = p_in[local_id];
barrier(CLK_LOCAL_MEM_FENCE);
//Process all data inside buffer by every WI in WG
//...
p_in += SOME_SIZE;
out_values += SOME_SIZE;
}
//...
return;
}
So, you're sliding with the window of fixed size & calculating data within & then going to next window. Al data operations are done independently, Work Items are only using same data at same time. Optimal size of local group is Device- and Kernel- dependent.
You can do it in many ways.
But the most generic way, without changing how your kernel behaves is to do it is reusing your workgroup size (whatever you selected, or default) and reuse the memory accesses from the group.
I would suggest something like this:
NOTE: I removed thouse ugly pointers for single values. OpenCL supports this, and it is much easier. There is no need to create a memory zone, just do clSetKernelArg(kernel, arg_index, sizeof(cl_float), &size); Where cl_float size = the_size;.
#define IN_LOCAL_SIZE 4096 //Because 16KB/4B (for each float)
void kernel perceptron(global const int in_layer_size, global const int out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
const int global_id = get_global_id(0);
__local float in_buffer[IN_LOCAL_SIZE];
float sum = 0.0f;
event_t ev;
int j;
//For each full buffer
for(j=0; j < (in_layer_size/IN_LOCAL_SIZE)-1; i++) {
ev = async_work_group_copy(in_buffer, in_value+j*IN_LOCAL_SIZE, IN_LOCAL_SIZE, ev);
wait_group_events(1,&ev);
barrier(CLK_LOCAL_MEM_FENCE);
for(int i=0; i < IN_LOCAL_SIZE; i++) {
sum += in_weights[(i+j*IN_LOCAL_SIZE)*out_layer_size+global_id] * in_buffer[i];
}
}
//Last one
ev = async_work_group_copy(in_buffer, in_value+j*IN_LOCAL_SIZE, in_layer_size%IN_LOCAL_SIZE, ev);
wait_group_events(1,&ev);
barrier(CLK_LOCAL_MEM_FENCE);
for(int i=0; i < in_layer_size%IN_LOCAL_SIZE; i++) {
sum += in_weights[(i+j*IN_LOCAL_SIZE)*out_layer_size+global_id] * in_buffer[i];
}
out_values[global_id] = sum;
}
However, if the output size is small (100k, 250k, 500), then you will have just 500 work items, which is not optimal. In that case you should reshape the algorithm.
One possible way to do it is that each workitem works in the inner layer, performing sums, and the whole work group creates one output out of all the work items. That would be easy, since you can control the sums inside the workgroup easily.
But maybe other approaches fit better your problem.
You can make large improvements by caching in_values in local memory. The fewer times you have to read each element of in_values from global memory, the better.
I have come up with a solution that caches the maximum number of input values, and reads each element from global memory only once per work group. This is done by copying a block of in_values at a time, processing it against all out_values, and moving on to the next block. There is also a local array of floats used to reduce the work items' sums of each block.
pseudocode:
output elements assumed to be set to 0 already
for each block of input values:
cache the input block
for each target output value:
reset local sum to 0
for each element this work item is responsible for:
read the weight, multiply, and add to sum
reduce sums to a single value, ADD value to output element
I haven't had a chance to run this through a profiler or debugger yet, but I will give it a try when I am back at my home PC. (no opencl tools at my office workstation). Make sure to queue kernel with group size equal to the GROUP_SIZE constant. Also, only create a single group per compute unit on your device.
real code:
//experiment with GROUP_SIZE to discover the optimal value for your device
//this needs to be equal to local_work_size passed into clEnqueueNDRangeKernel
//use a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
//max. for most devices is 256
#define GROUP_SIZE = 64;
// IN_VALUE_CACHE_SIZE is the number of floats from in_value to copy to local memory at a time
//assuming GROUP_SIZE can be up to 256, sizeof(float)=4, and local memory size is 32kb, full saturation can be achieved with the following:
//(32768 - (256 * 4)) /4 = 7936
//try another multiple of 1024 (6144, 4096... )if there is trouble with this value
#define IN_VALUE_CACHE_SIZE = 7936;
void kernel perceptron(global const int* in_layer_size, global const int* out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
private const int global_id = get_global_id(0);
private const int out_layer_s = *out_layer_size;
private const int in_layer_s = *in_layer_size;
private const int offset = out_layer_s * global_id;
private const int item_id = get_local_id(0);
private const int group_id = get_group_id(0);
private const int group_count = get_num_groups(0);
local float result_buffer[GROUP_SIZE];
local float in_value_cache[IN_VALUE_CACHE_SIZE];
int i,j,k;
//init the block to 0, in case there are fewer than IN_VALUE_CACHE_SIZE values in total
for(i=item_id; i<IN_VALUE_CACHE_SIZE; i+= GROUP_SIZE){
in_value_cache[i] = 0.0;
}
barrier(CL_LOCAL_MEM_FENCE);
private float sum = 0.0;
event_t e;
int copy_total = 0;
int copy_offset;
for(i=0; i<in_layer_s; i+=IN_VALUE_CACHE_SIZE){
//cap the number of values to copy to local memory if loop is near the end of the input data
copy_total = IN_VALUE_CACHE_SIZE;
if((copy_total + i*IN_VALUE_CACHE_SIZE) > in_layer_s){
copy_total = in_layer_s - i*IN_VALUE_CACHE_SIZE;
}
//copy the next block of values
e = async_work_group_copy(in_value_cache, in_value + i * 4, copy_total, 0);
wait_group_events(1, &e);
for(j=group_id; j<out_layer_s; j+=group_count){
sum = 0.0;
//need to reset result_buffer[item_id] as well
//this is in case there are fewer than GROUP_SIZE input values remaining ie copy_total < GROUP_SIZE
result_buffer[item_id] = 0.0;
for(k=item_id; k<copy_total; k+=GROUP_SIZE){
sum += in_value_cache[k] * in_weights[(k+i) + j * out_layer_s];
}
result_buffer[item_id] = sum;
//simple O(n) reduction can be optimized further
if(item_id == 0){
for(k=1;k<GROUP_SIZE;k++){
sum += result_buffer[k];
}
out_values[j] += sum;
}
barrier(CL_LOCAL_MEM_FENCE);
}
}
}
This will handle input of any size, so you can try it with as many elements as you have global memory for.

CUDA: different results from CPU

Some questions about CUDA.
1) I noticed that, in every sample code, operations which are not parallel (i.e., the computation of a scalar), performed in global functions, are always done specifying a certain thread. For example, in this simple code for a dot product, thread 0 performs the summation:
__global__ void dot( int *a, int *b, int *c )
{
// Shared memory for results of multiplication
__shared__ int temp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
// Thread 0 sums the pairwise products
if( 0 == threadIdx.x )
{
int sum = 0;
for( int i = 0; i < N; i++ )
sum += temp[i];
*c = sum;
}
}
This is fine for me; however, in a code which I wrote I did not specify the thread for the non-parallel operation, and it still works: hence, is it compulsory to define the thread? In particular, the non-parallel operation which I want to perform is the following:
if (epsilon == 1)
{
V[0] = B*(Exp - 1 - b);
}
else
{
V[0] = B*(Exp - 1 + a);
}
The various variables were passed as arguments of the global function. And here comes my second question.
2) I computed the value of V[0] with a program in CUDA and another serial on the CPU, obtaining different results. Obviously I thought that the problem in CUDA could be that I did not specify the thread, but, even with this, the result does not change, and it is still (much) greater from the serial one: 6.71201e+22 vs -2908.05. Where could be the problem? The other calculations performed in the global function are the following:
int tid = threadIdx.x;
if ( tid != 0 && tid < N )
{
{Various stuff which does not involve V or the variables used to compute V[0]}
V[tid] = B*(1/(1+alpha[tid]*alpha[tid])*(One_G[tid]*Exp - Cos - alpha[tid]*Sin) + kappa[tid]*Sin);
}
As you can see, in my condition I avoid to consider the case tid == 0.
3) Finally, a last question: usually in the sample codes I noticed that, if you want to use on the CPU values allocated and computed on the GPU memory, you should copy those values on the CPU (e.g, with command cudaMemcpy, specifying cudaMemcpyDeviceToHost). But I manage to use those values directly in the main code (CPU) without any problem. Can be this a clue that there is something wrong with my GPU (or my installation of CUDA), which also causes the previous odd things?
Thank you for your help.
== Added on the 5th January ==
Sorry for the late of my reply. Before invoking the kernel, there are all the memory allocations of the arrays to compute (which are quite a lot). In particular, the code about the array involved in my question is:
float * V;
cudaMalloc( (void**)&V, N * sizeof(float) );
At the end of the code I wrote:
float V_ [N];
cudaMemcpy( &V_, V, N * sizeof(float), cudaMemcpyDeviceToHost );
cudaFree(V);
cout << V_[0] << endl;
Thank you again for your attention.
if you don't have any cudaMemcpy in your code, that's exactly the problem. ;-)
The GPU is accessing it's own memory (the RAM on your graphics card), while the CPU is accessing the RAM on your mainboard.
You need to allocate and copy alpha, kappa, One_g and all other arrays to your GPU first, using cudaMemcpy, then run your kernel and after that copy your results back to the CPU.
Also, don't forget to allocate the memory on BOTH sides.
As for the non-parallel stuff: If the result is always the same, all threads will write the same thing, so the result is exactly the same, just quite a bit more inefficient, since all of them try to access the same resources.
Is that the exact code you're using?
In regards to question 1, you should have a __syncthreads() after the assignment to your shared memory, temp.
Otherwise you'll get a race condition where thread 0 can start the summation prior to temp being fully populated.
As for your other question about specifying the thread, if you have
if (epsilon == 1)
{
V[0] = B*(Exp - 1 - b);
}
else
{
V[0] = B*(Exp - 1 + a);
}
Then every thread will execute that code; for example, if you have X number of threads executing, and epsilon is 1 for all of them, then all X threads will evaluate the same line:
V[0] = B*(Exp - 1 - b);
and hence you'll have another race condition, as you'll have all X threads writing to V[0]. If all the threads have the same value for B*(Exp - 1 - b), then you might not notice a difference, while if they have different values then you're liable to get different results each time, depending on what order the threads arrive

Resources