I made a very naive implementation of the mergesort algorithm, which i turned to work on CUDA with very minimal implementation changes, the algorith code follows:
//Merge for mergesort
__device__ void merge(int* aux,int* data,int l,int m,int r)
{
int i,j,k;
for(i=m+1;i>l;i--){
aux[i-1]=data[i-1];
}
//Copy in reverse order the second subarray
for(j=m;j<r;j++){
aux[r+m-j]=data[j+1];
}
//Merge
for(k=l;k<=r;k++){
if(aux[j]<aux[i] || i==(m+1))
data[k]=aux[j--];
else
data[k]=aux[i++];
}
}
//What this code do is performing a local merge
//of the array
__global__
void basic_merge(int* aux, int* data,int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int tn = n / (blockDim.x*gridDim.x);
int l = i * tn;
int r = l + tn;
//printf("Thread %d: %d,%d: \n",i,l,r);
for(int i{1};i<=(tn/2)+1;i*=2)
for(int j{l+i};j<(r+1);j+=2*i)
{
merge(aux,data,j-i,j-1,j+i-1);
}
__syncthreads();
if(i==0){
//Complete the merge
do{
for(int i{tn};i<(n+1);i+=2*tn)
merge(aux,data,i-tn,i-1,i+tn-1);
tn*=2;
}while(tn<(n/2)+1);
}
}
The problem is that no matter how many threads i launch on my GTX 760, the sorting performance is always much much more worst than the same code on CPU running on 8 threads (My CPU have hardware support for up to 8 concurrent threads).
For example, sorting 150 million elements on CPU takes some hundred milliseconds, on GPU up to 10 minutes (even with 1024 threads per block)! Clearly i'm missing some important point here, can you please provide me with some comment? I strongly suspect the the problem is in the final merge operation performed by the first thread, at that point we have a certain amount of subarray (the exact amount depend on the number of threads) which are sorted and need to me merged, this is completed by just one thread (one tiny GPU thread).
I think i should use come kind of reduction here, so each thread perform in parallel further more merge, and the "Complete the merge" step just merge the last two sorted subarray..
I'm very new to CUDA.
EDIT (ADDENDUM):
Thanks for the link, I must admit I still need some time to learn better CUDA before taking full advantage of that material.. Anyway, I was able to rewrite the sorting function in order to take advantage as long as possible of multiple threads, my first implementation had a bottleneck in the last phase of the merge procedure, which was performed by only one multiprocessor.
Now after the first merge, I use each time up to (1/2)*(n/b) threads, where n is the amount of data to sort and b is the size of the chunk of data sorted by each threads.
The improvement in performance is surprising, using only 1024 threads it takes about ~10 seconds to sort 30 milion element.. Well, this is still a poor result unfortunately! The problem is in the threads syncronization, but first things first, let's see the code:
__global__
void basic_merge(int* aux, int* data,int n)
{
int k = blockIdx.x*blockDim.x + threadIdx.x;
int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1;
b = pow( (float)2, b);
int l=k*b;
int r=min(l+b-1,n-1);
__syncthreads();
for(int m{1};m<=(r-l);m=2*m)
{
for(int i{l};i<=r;i+=2*m)
{
merge(aux,data,i,min(r,i+m-1),min(r,i+2*m-1));
}
}
__syncthreads();
do{
if(k<=(n/b)*.5)
{
l=2*k*b;
r=min(l+2*b-1,n-1);
merge(aux,data,l,min(r,l+b-1),r);
}else break;
__syncthreads();
b*=2;
}while((r+1)<n);
}
The function 'merge' is the same as before. Now the problem is that I'm using only 1024 threads instead of the 65000 and more I can run on my CUDA device, the problem is that __syncthreads does not work as sync primitive at grid level, but only at block level!
So i can syncronize up to 1024 threads,that is the amount of threads supported per block. Without a proper syncronization each thread mess up the data of the other, and the merging procedure does not work.
In order to boost the performance I need some kind of syncronization between all the threads in the grid, seems that no API exist for this purpose, and i read about a solution which involve multiple kernel launch from the host code, using the host as barrier for all the threads.
I have a certain plan on how to implement this tehcnique in my mergesort function, I will provide you with the code in the near future. Did you have any suggestion on your own?
Thanks
It looks like all the work is being done in __global __ memory. Each write takes a long time and each read takes a long time making the function slow. I think it would help to maybe first copy your data to __shared __ memory first and then do the work in there and then when the sorting is completed(for that block) copy the results back to global memory.
Global memory takes about 400 clock cycles (or about 100 if the data happens to be in L2 cache). Shared memory on the other hand only takes 1-3 clock cycles to write and read.
The above would help with performance a lot. Some other super minor things you can try are..
(1) remove the first __syncthreads(); It is not really doing anything because no data is being past in between warps at that point.
(2) Move the "int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1; b = pow( (float)2, b);" outside the kernel and just pass in b instead. This is being calculated over and over when it really only needs to be calculated once.
I tried to follow along on your algorithm but was not able to. The variable names were hard to follow...or... your code is above my head and I cannot follow. =) Hope the above helps.
Related
I've been a good boy and parallelized my compute shader to execute 955 threads for 20 iterations
[numthreads(955, 1, 1)]
void main( uint3 pos : SV_DispatchThreadID )
{
...
for (uint i = 0; i < 20; i++)
{
GroupMemoryBarrierWithGroupSync();
//read from and write to groupshared memory
}
}
But this isn't going to work out (because the parallelization introduces a realtime delay) so I have to do it a less parallel way. The easy way to approach the problem is to have 20 threads doing 955 iterations each
[numthreads(20, 1, 1)]
void main( uint3 pos : SV_DispatchThreadID )
{
...
for (uint i = 0; i < 955; i++)
{
GroupMemoryBarrierWithGroupSync();
//read from and write to groupshared memory
}
}
However, I can't reason about how this is going to perform (probably terribly).
I under this new approach I must keep the number iterations the same, but can trade off the frequency which I call the compute shader with the number of threads. Which gives me two options:
Increase 20 -> 32 to have a full warp.
Increase 20 -> 32 * n to have warps running in parallel.
Maybe accessing groupshared memory is very cheap and so I don't have a performance problem in the first place.
Maybe I should try to optimize this on the cpu (I've already tried unoptimized and the performance was less than desired).
Someone commented on this answer
To be specific, a single-thread group will generally cap utilization to around 3-6%. Dispatching only one group compounds the issue, capping utilization to well under 1%. Sticking to 256 threads with power-of-two dimension sizes is a good rule of thumb, and you should dispatch at least 2048 or so threads total to keep the hardware busy.
and I decided that doing this work on the gpu is a stupid thing to do. It's always best to look for robust solutions.
The rubust solution for my problem is to use SIMD, which I will have to now learn the hard way.
I have two vectors and I want to calculate dot product of those two vectors in parallel. I was able to do multiplication of each element in parallel and after that addition in parallel. following is the code which I have tried.
But I want to do both multiplication and addition in parallel. That means elements which have performed multiplication should be added even if other elements haven't done multiplication yet. Hope you understood what I have said.
#include<stdio.h>
#include<cuda.h>
__global__ void dotproduct(int *a,int *b,int *c,int N)
{
int k=N;
int i=threadIdx.x;
c[i]=a[i]*b[i];
if(N%2==1)
N=N+1;
__syncthreads();
while(i<(N/2))
{
if((i+1)*2<=k)
{
c[i]=c[i*2]+c[i*2+1];
}
else
{
c[i]=c[k-1];
}
k=N/2;
N=N/2;
if(N%2==1&&N!=1)
N=N+1;
__syncthreads(); //wait for all the threads to synchronize
}
}
int main()
{
int N=10; //vector size
int a[N],b[N],c[N];
int *dev_a,*dev_b,*dev_c;
cudaMalloc((void**)&dev_a,N*sizeof(int));
cudaMalloc((void**)&dev_b,N*sizeof(int));
cudaMalloc((void**)&dev_c,N*sizeof(int));
for(int i=0;i<N;i++)
{
a[i]=rand()%10;
b[i]=rand()%10;
}
cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice);
dotproduct<<<1,N>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);
for(int i=0;i<N;i++)
{
printf("%d,%d\n",a[i],b[i]);
}
printf("the answer is : %d in GPU\n",c[0]);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
cudaThreadExit();
return 0;
}
I don't think it makes sense to do multiplication and addition in parallel - all multiplications will take the same time, and by trying to run different instructions at the same time can reduce the performance. But the part in which you sum the multiplication results can be optimized.
You many need to use atomics or shuffle instructions - read this for a good explanation: https://devblogs.nvidia.com/faster-parallel-reductions-kepler/
And if it's not an exercise but a real task, I suggest you use cuBLAS, it has this functionality build in: https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-dot
Take your favorite reduction (= adding-up of the values of a vector), and modify it so that instead of every read from the single input vector, you perform two reads, with the same index, from two input vectors, and multiply the results.
This will maintain as much parallelism as you were able to effect using the reduction kernel; and if your reads were memory-coalesced before, they will also be coalesced now. Your throughput in terms of output elements per time unit should be almost exactly half what it was before, for very long vectors.
I think that there is also a special PTX ISA instruction which does dot-product at once ( dp4a.atype.btype d, a, b, c; ). With little effort you can try to write a small inline PTX assembly function. Check the documentation.
Disclaimer: I am fairly new to CUDA and parallel programming - so if you're not going to bother to answer my question, just ignore this, or at least point me to the right resources so I can find the answer myself.
Here's the particular problem I'm looking to solve using parallel programming. I have some 1D arrays that store 3D vectors in this format -> [v0x, v0y, v0z, ... vnx, vny, vnz], where n is the vector, and x, y, z are the respective components.
Suppose I want to find the cross product between vectors [v0, v1, ... vn] in one array and their corresponding vectors [v0, v1, ... vn] in another array.
The calculation is pretty straightforward without parallelization:
result[x] = vec1[y]*vec2[z] - vec1[z]*vec2[y];
result[y] = vec1[z]*vec2[x] - vec1[x]*vec2[z];
result[z] = vec1[x]*vec2[y] - vec1[y]*vec2[x];
The problem I'm having is understanding how to implement CUDA parallelization for the arrays I currently have. Since each value in the result vector is a separate calculation, I can effectively run the above calculation for each vector in parallel. Since each component of the resulting cross product is a separate calculation, those too could run in parallel. How would I go about setting up the blocks and threads/ go about thinking about setting up the threads for such a problem?
The top 2 optimization priorities for any CUDA programmer are to use memory efficiently, and expose enough parallelism to hide latency. We'll use those to guide our algorithmic choices.
A very simple thread strategy (the thread strategy answers the question, "what will each thread do or be responsible for?") in any transformation (as opposed to reduction) type problem is to have each thread be responsible for 1 output value. Your problem fits the description of transformation - the output data set size is on the order of the input data set size(s).
I'll assume that you intended to have two equal length vectors containing your 3D vectors, and that you want to take the cross product of the first 3D vectors in each and the 2nd 3D vectors in each, and so on.
If we choose a thread strategy of 1 output point per thread (i.e. result[x] or result[y] or result[z], all together would be 3 output points), then we will need 3 threads to compute the output of each vector cross product. If we have enough vectors to multiply, then we will have enough threads to keep our machine "busy" and do a good job of hiding latency. As a rule of thumb, your problem will start to become interesting on GPUs if the number of threads is 10000 or more, so this means we would want your 1D vectors to consist of about 3000 3D vectors or more. Let's assume that is the case.
In order to tackle the memory efficiency objective, our first task is to load your vector data from global memory. We will want this ideally to be coalesced, which roughly means adjacent threads access adjacent elements in memory. We'll want the output stores to be coalesced also, and our thread strategy of choosing one output point/one vector component per thread will work nicely to support that.
For efficient memory usage, we'd like to ideally load each item from global memory only once. Your algorithm naturally involves a small amount of data reuse. The data reuse is evident since the computation of result[y] depends on vec2[z] and the computation of result[x] also depends on vec2[z] to pick just one example. Therefore a typical strategy when there is data reuse is to load the data first into CUDA shared memory, and then allow the threads to perform their computations based on the data in shared memory. As we will see, this makes it fairly easy/convenient for us to arrange for coalesced loads from global memory, since the global data load arrangement is no longer tightly coupled to the threads or the usage of the data for computation.
The last challenge is to figure out an indexing pattern so that each thread will select the proper elements from shared memory to multiply together. If we look at your calculation pattern that you have depicted in your question, we see that the first load from vec1 follows an offset pattern of +1(modulo 3) from the index that the result is being computed for. So x->y, y->z, and z -> x. Likewise we see a +2(modulo 3) for the next load from vec2, another +2(modulo 3) pattern for the next load from vec1 and another +1(modulo 3) pattern for the final load from vec2.
If we combine all these ideas, we can then write a kernel that should have generally efficient characteristics:
$ cat t1003.cu
#include <stdio.h>
#define TV1 1
#define TV2 2
const size_t N = 4096; // number of 3D vectors
const int blksize = 192; // choose as multiple of 3 and 32, and less than 1024
typedef float mytype;
//pairwise vector cross product
template <typename T>
__global__ void vcp(const T * __restrict__ vec1, const T * __restrict__ vec2, T * __restrict__ res, const size_t n){
__shared__ T sv1[blksize];
__shared__ T sv2[blksize];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < 3*n){ // grid-stride loop
// load shared memory using coalesced pattern to global memory
sv1[threadIdx.x] = vec1[idx];
sv2[threadIdx.x] = vec2[idx];
// compute modulo/offset indexing for thread loads of shared data from vec1, vec2
int my_mod = threadIdx.x%3; // costly, but possibly hidden by global load latency
int off1 = my_mod+1;
if (off1 > 2) off1 -= 3;
int off2 = my_mod+2;
if (off2 > 2) off2 -= 3;
__syncthreads();
// each thread loads its computation elements from shared memory
T t1 = sv1[threadIdx.x-my_mod+off1];
T t2 = sv2[threadIdx.x-my_mod+off2];
T t3 = sv1[threadIdx.x-my_mod+off2];
T t4 = sv2[threadIdx.x-my_mod+off1];
// compute result, and store using coalesced pattern, to global memory
res[idx] = t1*t2-t3*t4;
idx += gridDim.x*blockDim.x;} // for grid-stride loop
}
int main(){
mytype *h_v1, *h_v2, *d_v1, *d_v2, *h_res, *d_res;
h_v1 = (mytype *)malloc(N*3*sizeof(mytype));
h_v2 = (mytype *)malloc(N*3*sizeof(mytype));
h_res = (mytype *)malloc(N*3*sizeof(mytype));
cudaMalloc(&d_v1, N*3*sizeof(mytype));
cudaMalloc(&d_v2, N*3*sizeof(mytype));
cudaMalloc(&d_res, N*3*sizeof(mytype));
for (int i = 0; i<N; i++){
h_v1[3*i] = TV1;
h_v1[3*i+1] = 0;
h_v1[3*i+2] = 0;
h_v2[3*i] = 0;
h_v2[3*i+1] = TV2;
h_v2[3*i+2] = 0;
h_res[3*i] = 0;
h_res[3*i+1] = 0;
h_res[3*i+2] = 0;}
cudaMemcpy(d_v1, h_v1, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_v2, h_v2, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
vcp<<<(N*3+blksize-1)/blksize, blksize>>>(d_v1, d_v2, d_res, N);
cudaMemcpy(h_res, d_res, N*3*sizeof(mytype), cudaMemcpyDeviceToHost);
// verification
for (int i = 0; i < N; i++) if ((h_res[3*i] != 0) || (h_res[3*i+1] != 0) || (h_res[3*i+2] != TV1*TV2)) { printf("mismatch at %d, was: %f, %f, %f, should be: %f, %f, %f\n", i, h_res[3*i], h_res[3*i+1], h_res[3*i+2], (float)0, (float)0, (float)(TV1*TV2)); return -1;}
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
$ nvcc t1003.cu -o t1003
$ cuda-memcheck ./t1003
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$
Note that I've chosen to write the kernel using a grid-stride loop. This isn't terribly important to this discussion, and not that relevant for this problem, because I've chosen a grid size equal to the problem size (4096*3). However for much larger problem sizes, you might choose a smaller grid size than the overall problem size, for some possible small efficiency gain.
For such a simple problem as this, it's fairly easy to define "optimality". The optimal scenario would be however long it takes to load the input data (just once) and write the output data. If we consider a larger version of the test code above, changing N to 40960 (and making no other changes), then the total data read and written would be 40960*3*4*3 bytes. If we profile that code and then compare to bandwidthTest as a proxy for peak achievable memory bandwidth, we observe:
$ CUDA_VISIBLE_DEVICES="1" nvprof ./t1003
==27861== NVPROF is profiling process 27861, command: ./t1003
no error
==27861== Profiling application: ./t1003
==27861== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 65.97% 162.22us 2 81.109us 77.733us 84.485us [CUDA memcpy HtoD]
30.04% 73.860us 1 73.860us 73.860us 73.860us [CUDA memcpy DtoH]
4.00% 9.8240us 1 9.8240us 9.8240us 9.8240us void vcp<float>(float const *, float const *, float*, unsigned long)
API calls: 99.10% 249.79ms 3 83.263ms 6.8890us 249.52ms cudaMalloc
0.46% 1.1518ms 96 11.998us 374ns 454.09us cuDeviceGetAttribute
0.25% 640.18us 3 213.39us 186.99us 229.86us cudaMemcpy
0.10% 255.00us 1 255.00us 255.00us 255.00us cuDeviceTotalMem
0.05% 133.16us 1 133.16us 133.16us 133.16us cuDeviceGetName
0.03% 71.903us 1 71.903us 71.903us 71.903us cudaLaunchKernel
0.01% 15.156us 1 15.156us 15.156us 15.156us cuDeviceGetPCIBusId
0.00% 7.0920us 3 2.3640us 711ns 4.6520us cuDeviceGetCount
0.00% 2.7780us 2 1.3890us 612ns 2.1660us cuDeviceGet
0.00% 1.9670us 1 1.9670us 1.9670us 1.9670us cudaGetLastError
0.00% 361ns 1 361ns 361ns 361ns cudaGetErrorString
$ CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla K20Xm
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6375.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6554.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 171220.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
The kernel takes 9.8240us to execute, and in that time loads or stores a total of 40960*3*4*3 bytes of data. Therefore the achieved memory bandwidth by the kernel is 40960*3*4*3/0.000009824 or 150 GB/s. The proxy measurement for peak achievable on this GPU is 171 GB/s, so this kernel achieves 88% of the optimal throughput. With more careful benchmarking to run the kernel twice in a row, the 2nd execution requires only 8.99us to execute. This brings the achieved bandwidth in this case up to 96% of peak achievable throughput.
I'm a beginner at cuda and am having some difficulties with it
If I have an input vector A and a result vector B both with size N, and B[i] depends on all elements of A except A[i], how can I code this without having to call a kernel multiple times inside a serial for loop? I can't think of a way to paralelise both the outer and inner loop simultaneously.
edit: Have a device with cc 2.0
example:
// a = some stuff
int i;
int j;
double result = 0;
for(i=0; i<1000; i++) {
double ai = a[i];
for(j=0; j<1000; j++) {
double aj = a[j];
if (i == j)
continue;
result += ai - aj;
}
}
I have this at the moment:
//in host
int i;
for(i=0; i<1000; i++) {
kernelFunc <<<2, 500>>> (i, d_a)
}
Is there a way to eliminate the serial loop?
Something like this should work, I think:
__global__ void my_diffs(const double *a, double *b, const length){
unsigned idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < length){
double my_a = a[idx];
double result = 0.0;
for (int j=0; j<length; j++)
result += my_a - a[j];
b[idx] = result;
}
}
(written in browser, not tested)
This can possibly be further optimized in a couple ways, however for cc 2.0 and newer devices that have L1 cache, the benefits of these optimizations might be small:
use shared memory - we can reduce the number of global loads to one per element per block. However, the initial loads will be cached in L1, and your data set is quite small (1000 double elements ?) so the benefits might be limited
create an offset indexing scheme, so each thread is using a different element from the cacheline to create coalesced access (i.e. modify j index for each thread). Again, for cc 2.0 and newer devices, this may not help much, due to L1 cache as well as the ability to broadcast warp global reads.
If you must use a cc 1.x device, then you'll get significant mileage out of one or more optimizations -- the code I've shown here will run noticeably slower in that case.
Note that I've chosen not to bother with the special case where we are subtracting a[i] from itself, as that should be approximately zero anyway, and should not disturb your results. If you're concerned about that, you can special-case it out, easily enough.
You'll also get more performance if you increase the blocks and reduce the threads per block, perhaps something like this:
my_diffs<<<8,128>>>(d_a, d_b, len);
The reason for this is that many GPUs have more than 1 or 2 SMs. To maximize perf on these GPUs with such a small data set, we want to try and get at least one block launched on each SM. Having more blocks in the grid makes this more likely.
If you want to fully parallelize the computation, the approach would be to create a 2D matrix (let's call it c[...]) in GPU memory, of square dimensions equal to the length of your vector. I would then create a 2D grid of threads, and have each thread perform the subtraction (a[row] - a[col]) and store it's result in c[row*len+col]. I would then launch a second (1D) kernel to sum the columns of c (each thread has a loop to sum a column) to create the result vector b. However I'm not sure this would be any faster than the approach I've outlined. Such a "more fully parallelized" approach also wouldn't lend itself as easily to the optimizations I discussed.
I'm having trouble with the simple task of finding the maximum of an array in OpenCL.
__kernel void ndft(/* lots of stuff*/)
{
size_t thread_id = get_global_id(0); // thread_id = [0 .. spectrum_size[
/* MATH MAGIC */
// Now I have float spectrum_abs[spectrum_size] and
// I want the maximum as well as the index holding the maximum
barrier();
// this is the old, sequential code:
if (*current_max_value < spectrum_abs[i])
{
*current_max_value = spectrum_abs[i];
*current_max_freq = i;
}
}
Now I could add if (thread_id == 0) and loop through the entire thing as I would do on a single core system, but since performance is a critical issue (otherwise I wouldn't be doing spectrum calculations on a GPU), is there a faster way to do that?
Returning to the CPU at the end of the kernel above is not an option, because the kernel actually continues after that.
You will need to write a parallel reduction. Split your "large" array into small pieces (a size a single workgroup can effectively process) and compute the min-max in each.
Do this iteratively (involves both host and device code) till you are left with only one set of min/max values.
Note that you might need to write a separate kernel that does this unless the current work-distribution works for this piece of the kernel (see my question to you above).
An alternative if your current work distribution is amenable is to find the min max inside of each workgroup and write it to a buffer in global memory (index = local_id). After a barrier(), simply make the kernel running on thread_id == 0 loop across the reduced results and find the max in it. This will not be the optimal solution, but might be one that fits inside your current kernel.