Perform both multiplication and addition in parallel using cuda - parallel-processing

I have two vectors and I want to calculate dot product of those two vectors in parallel. I was able to do multiplication of each element in parallel and after that addition in parallel. following is the code which I have tried.
But I want to do both multiplication and addition in parallel. That means elements which have performed multiplication should be added even if other elements haven't done multiplication yet. Hope you understood what I have said.
#include<stdio.h>
#include<cuda.h>
__global__ void dotproduct(int *a,int *b,int *c,int N)
{
int k=N;
int i=threadIdx.x;
c[i]=a[i]*b[i];
if(N%2==1)
N=N+1;
__syncthreads();
while(i<(N/2))
{
if((i+1)*2<=k)
{
c[i]=c[i*2]+c[i*2+1];
}
else
{
c[i]=c[k-1];
}
k=N/2;
N=N/2;
if(N%2==1&&N!=1)
N=N+1;
__syncthreads(); //wait for all the threads to synchronize
}
}
int main()
{
int N=10; //vector size
int a[N],b[N],c[N];
int *dev_a,*dev_b,*dev_c;
cudaMalloc((void**)&dev_a,N*sizeof(int));
cudaMalloc((void**)&dev_b,N*sizeof(int));
cudaMalloc((void**)&dev_c,N*sizeof(int));
for(int i=0;i<N;i++)
{
a[i]=rand()%10;
b[i]=rand()%10;
}
cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice);
dotproduct<<<1,N>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);
for(int i=0;i<N;i++)
{
printf("%d,%d\n",a[i],b[i]);
}
printf("the answer is : %d in GPU\n",c[0]);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
cudaThreadExit();
return 0;
}

I don't think it makes sense to do multiplication and addition in parallel - all multiplications will take the same time, and by trying to run different instructions at the same time can reduce the performance. But the part in which you sum the multiplication results can be optimized.
You many need to use atomics or shuffle instructions - read this for a good explanation: https://devblogs.nvidia.com/faster-parallel-reductions-kepler/
And if it's not an exercise but a real task, I suggest you use cuBLAS, it has this functionality build in: https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-dot

Take your favorite reduction (= adding-up of the values of a vector), and modify it so that instead of every read from the single input vector, you perform two reads, with the same index, from two input vectors, and multiply the results.
This will maintain as much parallelism as you were able to effect using the reduction kernel; and if your reads were memory-coalesced before, they will also be coalesced now. Your throughput in terms of output elements per time unit should be almost exactly half what it was before, for very long vectors.

I think that there is also a special PTX ISA instruction which does dot-product at once ( dp4a.atype.btype d, a, b, c; ). With little effort you can try to write a small inline PTX assembly function. Check the documentation.

Related

Vulkan compute shader for parallel sum reduction

I want to implement this algorithm
https://dournac.org/info/gpu_sum_reduction
in Vulkan's compute shader. In OpenCL it would be easy because I can explicitly declare
what buffers are __local and which are __global. Unfortunately, I can't seem to find
any such mechanisms in Vulkan. Could somebody more experienced, show me an example, how to get such things working in Vulkan, please?
Subgroups in Vulkan seem equivalent functionality. It is a functionality where the shader invocations can cooperate in the subgroup.
Maybe something like:
void main(){
int partial_sum = subgroupAdd(arr[gl_GlobalInvocationID.x]);
if (subgroupElect()) {
atomicAdd(mem, partial_sum);
}
}
You could study the Subgroup Tutorial.
Then again you could just try going about it the "normal" way by simply:
void main(){
int partial_sum = 0;
for( int i = 0; i < LOCAL_SIZE; ++i ){
partial_sum += arr[gl_WorkGroupID.x * gl_WorkGroupSize.x + i]
}
atomicAdd(mem, partial_sum); // or perhaps without atomics by recursively reducing
}
It is not so much "parallel", but then again needs no barriers. It is just a matter of measuring performance to find what works best, and also might depend how large do you assume the input arrays are.
Disclaimer: I have not tried the shaders, so assume they are kind of a pseudocode and can have bugs.
It should also be possible to implement your linked algorithm nearly verbatim. The equivalent to __local in compute GLSL is shared. The workgroup barrier in GLSL is memoryBarrierShared().

CSR x CSR Matrix Multiplication for Finding Out Cycles

I am trying to find out number of cycles in an undirected graph with specified length(k) containing vertex u for each vertex u in the graph. To do so I am trying to find out adjacency matrix's k'th power. I created CSR representation of the graph from the edge list. It is working really fast. But the CSR x CSR multiplication part is really slow, (it seems to be taking 50 min with an input size of 500k x 500k matrix). I am curious about a better solution. Is there a more efficient way to go since this is a adjacency matrix? Or Is there any better CSRxCSR matrix multiplication that I could look at? I could not find any CSR X CSR matrix multiplication example as an algorithm or c++ implementation.
void multiply_matrix(std::vector<int> &adj, std::vector<int> &xadj, std::vector<int> &values, std::vector<int> &adj2, std::vector<int> &xadj2, std::vector<int> &values2, int size)
{
std::vector<int> result_adj;
std::vector<int> result_xadj(size+1,0);
std::vector<int> result_value(values.size(),0);
for(int i = 0; i<size; i++)
{
for(int j = 0; j<size; j++)
{
int result = 0;
int startIndex = xadj[i];
int endIndex = xadj[i+1];
for(int index = startIndex; index<endIndex; index++)
{
int currentValRow = values[adj[index]];
bool shouldContinue = false;
for(int colIndex = xadj2[j]; colIndex<xadj2[j+1]; colIndex++)
{
if(adj[index] == adj2[colIndex])
{
shouldContinue = true;
break;
}
}
if(!shouldContinue)
continue;
int currentValCol = values2[adj2[index]];
result += currentValCol*currentValRow;
}
if(result != 0)
{
result_xadj[i+1]++;
if(i+2 < result_xadj.size())
result_xadj[i+2] = result_xadj[i+1];
result_adj.push_back(j);
result_value[j] = result;
}
}
}
}
I solved my problem and wanted to share with those who also lacks the required "terminology" to find out lots of resources on the topic. When you google "sparse matrix multiplication" it is hard to find sparse matrix x sparse matrix. Which is called SpGEMM. There are lots of informative papers about the process.
The pseudocode of the algorithm I used:
General SpGEMM algorithm
I modified the algorithm a little bit to produce CSR output. The challange with that seems to be the allocation for result arrays to hold csr arrays (values, index_array, etc..). There are different methods used to solve that issue such as:
Allocating the arrays as big as the upper bound. Which may be a problem if your matrices are too big. If you decide to go this way you can look into: https://math.stackexchange.com/questions/1042096/bounds-of-sparse-matrix-multiplication.
Before allocating any memory for the result the multiplication operation can be done to determine the amount of non zeros in the result. Since there is no memory write operation exists in this space the result comes out really fast. So the memory required for the result arrays can be allocated after this "dummy run".
Allocating a pre-determined amount and when it is not sufficient allocating a new array and copying the content to the new, bigger array.
I implemented the function for both CPU (using OpenMP) and GPU (using CUDA). In the OpenMP approach I used a method similar to option 3 that i have listed. I used separate vectors for results of each row. Than I added the resulting vectors. The vector approach may be slower than doing the re-allocation operation manually but it was easier so I choose that way and it is fast enough (the test matrix had 500k row and 500k column the multiplication operation takes around 1.3 seconds using 60 threads on my test machine). For the GPU approach I used the option 2. At first I calculated the required amount then the actual operation happens.
Edit: Also this method finds out "walks" rather than paths. So there might be repeated vertices.

PyOpenCL - Multi-dimensional reduction kernel

I'm a total newbie to OpenCL.
I'm trying to code a reduction kernel that sums along one axis for a multi-dimensional array. I have stumbled upon that code which comes from here: https://tmramalho.github.io/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
__kernel void reduce(__global float *a, __global float *r, __local float *b) {
uint gid = get_global_id(0);
uint wid = get_group_id(0);
uint lid = get_local_id(0);
uint gs = get_local_size(0);
b[lid] = a[gid];
barrier(CLK_LOCAL_MEM_FENCE);
for(uint s = gs/2; s > 0; s >>= 1) {
if(lid < s) {
b[lid] += b[lid+s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) r[wid] = b[lid];
}
I don't understand the for loop part. I get that uint s = gs/2 means that we split the array in half, but then it is a complete mystery. Without understanding it, I can't really implement another version for taking the maximum of an array for instance, even less for multi-dimensional arrays.
Furthermore, as far as I understand, the reduce kernel needs to be rerun another time if "N is bigger than the number of cores in a single unit".
Could you give me further explanations on that whole piece of code? Or even guidance on how to implement it for taking the max of an array?
Complete code can be found here: https://github.com/tmramalho/easy-pyopencl/blob/master/008_localreduce.py
Your first question about the meaning of the for loop:
for(uint s = gs/2; s > 0; s >>= 1)
It means that you divide the local size gs by 2, and keep dividing by 2 (the shift part s >>= 1 is equivalent to s = s/2) while s > 0, in other words, until s = 1. This algorithm depends on your array's size being a power of 2, otherwise you'd have to deal with the excess of a power of 2 until you have reduced the whole array, or you'd have to fill your array with neutral values for the reduction until completing a power of 2 size.
Your second concern when N is bigger than the capacity of your GPU, you are right: you have to run your reduction in portions that fit and then merge the results.
Finally, when you ask for guidance on how to implement a reduction to get the max of an array, I would suggest the following:
For a simple reduction like max or sum, try using numpy, especially if you are dealing with programming the reduction by axis.
If you think that the GPU would give you an advantage, try first using pyopencl's Multidimensional Array functionality, e.g. max.
If the reduction is more math intensive, try using pyopencl's Parallel Algorithms, e.g. reduction
I think that the whole point of using pyopencl is to avoid dealing with the underlying GPU's architecture. Otherwise, it is easier to deal with CUDA or HIP directly instead of OpenCL.

GPU sorting vs CPU sorting

I made a very naive implementation of the mergesort algorithm, which i turned to work on CUDA with very minimal implementation changes, the algorith code follows:
//Merge for mergesort
__device__ void merge(int* aux,int* data,int l,int m,int r)
{
int i,j,k;
for(i=m+1;i>l;i--){
aux[i-1]=data[i-1];
}
//Copy in reverse order the second subarray
for(j=m;j<r;j++){
aux[r+m-j]=data[j+1];
}
//Merge
for(k=l;k<=r;k++){
if(aux[j]<aux[i] || i==(m+1))
data[k]=aux[j--];
else
data[k]=aux[i++];
}
}
//What this code do is performing a local merge
//of the array
__global__
void basic_merge(int* aux, int* data,int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int tn = n / (blockDim.x*gridDim.x);
int l = i * tn;
int r = l + tn;
//printf("Thread %d: %d,%d: \n",i,l,r);
for(int i{1};i<=(tn/2)+1;i*=2)
for(int j{l+i};j<(r+1);j+=2*i)
{
merge(aux,data,j-i,j-1,j+i-1);
}
__syncthreads();
if(i==0){
//Complete the merge
do{
for(int i{tn};i<(n+1);i+=2*tn)
merge(aux,data,i-tn,i-1,i+tn-1);
tn*=2;
}while(tn<(n/2)+1);
}
}
The problem is that no matter how many threads i launch on my GTX 760, the sorting performance is always much much more worst than the same code on CPU running on 8 threads (My CPU have hardware support for up to 8 concurrent threads).
For example, sorting 150 million elements on CPU takes some hundred milliseconds, on GPU up to 10 minutes (even with 1024 threads per block)! Clearly i'm missing some important point here, can you please provide me with some comment? I strongly suspect the the problem is in the final merge operation performed by the first thread, at that point we have a certain amount of subarray (the exact amount depend on the number of threads) which are sorted and need to me merged, this is completed by just one thread (one tiny GPU thread).
I think i should use come kind of reduction here, so each thread perform in parallel further more merge, and the "Complete the merge" step just merge the last two sorted subarray..
I'm very new to CUDA.
EDIT (ADDENDUM):
Thanks for the link, I must admit I still need some time to learn better CUDA before taking full advantage of that material.. Anyway, I was able to rewrite the sorting function in order to take advantage as long as possible of multiple threads, my first implementation had a bottleneck in the last phase of the merge procedure, which was performed by only one multiprocessor.
Now after the first merge, I use each time up to (1/2)*(n/b) threads, where n is the amount of data to sort and b is the size of the chunk of data sorted by each threads.
The improvement in performance is surprising, using only 1024 threads it takes about ~10 seconds to sort 30 milion element.. Well, this is still a poor result unfortunately! The problem is in the threads syncronization, but first things first, let's see the code:
__global__
void basic_merge(int* aux, int* data,int n)
{
int k = blockIdx.x*blockDim.x + threadIdx.x;
int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1;
b = pow( (float)2, b);
int l=k*b;
int r=min(l+b-1,n-1);
__syncthreads();
for(int m{1};m<=(r-l);m=2*m)
{
for(int i{l};i<=r;i+=2*m)
{
merge(aux,data,i,min(r,i+m-1),min(r,i+2*m-1));
}
}
__syncthreads();
do{
if(k<=(n/b)*.5)
{
l=2*k*b;
r=min(l+2*b-1,n-1);
merge(aux,data,l,min(r,l+b-1),r);
}else break;
__syncthreads();
b*=2;
}while((r+1)<n);
}
The function 'merge' is the same as before. Now the problem is that I'm using only 1024 threads instead of the 65000 and more I can run on my CUDA device, the problem is that __syncthreads does not work as sync primitive at grid level, but only at block level!
So i can syncronize up to 1024 threads,that is the amount of threads supported per block. Without a proper syncronization each thread mess up the data of the other, and the merging procedure does not work.
In order to boost the performance I need some kind of syncronization between all the threads in the grid, seems that no API exist for this purpose, and i read about a solution which involve multiple kernel launch from the host code, using the host as barrier for all the threads.
I have a certain plan on how to implement this tehcnique in my mergesort function, I will provide you with the code in the near future. Did you have any suggestion on your own?
Thanks
It looks like all the work is being done in __global __ memory. Each write takes a long time and each read takes a long time making the function slow. I think it would help to maybe first copy your data to __shared __ memory first and then do the work in there and then when the sorting is completed(for that block) copy the results back to global memory.
Global memory takes about 400 clock cycles (or about 100 if the data happens to be in L2 cache). Shared memory on the other hand only takes 1-3 clock cycles to write and read.
The above would help with performance a lot. Some other super minor things you can try are..
(1) remove the first __syncthreads(); It is not really doing anything because no data is being past in between warps at that point.
(2) Move the "int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1; b = pow( (float)2, b);" outside the kernel and just pass in b instead. This is being calculated over and over when it really only needs to be calculated once.
I tried to follow along on your algorithm but was not able to. The variable names were hard to follow...or... your code is above my head and I cannot follow. =) Hope the above helps.

set RNG state with openMP and Rcpp

I have a clarification question.
It is my understanding, that sourceCpp automatically passes on the RNG state, so that set.seed(123) gives me reproducible random numbers when calling Rcpp code. When compiling a package, I have to add a set RNG statement.
Now how does this all work with openMP either in sourceCpp or within a package?
Consider the following Rcpp code
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp1(int n, double mu, double sigma ){
Rcpp::NumericVector out(n);
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp2(int n, double mu, double sigma, int cores=1 ){
omp_set_num_threads(cores);
Rcpp::NumericVector out(n);
#pragma omp parallel for schedule(dynamic)
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
And then run
set.seed(123)
a1=rnormrcpp1(100,2,3,2)
set.seed(123)
a2=rnormrcpp1(100,2,3,2)
set.seed(123)
a3=rnormrcpp2(100,2,3,2)
set.seed(123)
a4=rnormrcpp2(100,2,3,2)
all.equal(a1,a2)
all.equal(a3,a4)
While a1 and a2 are identical, a3 and a4 are not. How can I adjust the RNG state with the openMP loop? Can I?
To expand on what Dirk Eddelbuettel has already said, it is next to impossible to both generate the same PRN sequence in parallel and have the desired speed-up. The root of this is that generation of PRN sequences is essentially a sequential process where each state depends on the previous one and this creates a backward dependence chain that reaches back as far as the initial seeding state.
There are two basic solutions to this problem. One of them requires a lot of memory and the other one requires a lot of CPU time and both are actually more like workarounds than true solutions:
pregenerated PRN sequence: One thread generates sequentially a huge array of PRNs and then all threads access this array in a manner that would be consistent with the sequential case. This method requires lots of memory in order to store the sequence. Another option would be to have the sequence stored into a disk file that is later memory-mapped. The latter method has the advantage that it saves some compute time, but generally I/O operations are slow, so it only makes sense on machines with limited processing power or with small amounts of RAM.
prewound PRNGs: This one works well in cases when work is being statically distributed among the threads, e.g. with schedule(static). Each thread has its own PRNG and all PRNGs are seeded with the same initial seed. Then each thread draws as many dummy PRNs as its starting iteration, essentially prewinding its PRNG to the correct position. For example:
thread 0: draws 0 dummy PRNs, then draws 100 PRNs and fills out(0:99)
thread 1: draws 100 dummy PRNs, then draws 100 PRNs and fills out(100:199)
thread 2: draws 200 dummy PRNs, then draws 100 PRNs and fills out(200:299)
and so on. This method works well when each thread does a lot of computations besides drawing the PRNs since the time to prewind the PRNG could be substantial in some cases (e.g. with many iterations).
A third option exists for the case when there is a lot of data processing besides drawing a PRN. This one uses OpenMP ordered loops (note that the iteration chunk size is set to 1):
#pragma omp parallel for ordered schedule(static,1)
for (int i=0; i < n; i++) {
#pragma omp ordered
{
rnum = R::rnorm(mu,sigma);
}
out(i) = lots of processing on rnum
}
Although loop ordering essentially serialises the computation, it still allows for lots of processing on rnum to execute in parallel and hence parallel speed-up would be observed. See this answer for a better explanation as to why so.
Yes, sourceCpp() etc and an instantiation of RNGScope so the RNGs are left in a proper state.
And yes one can do OpenMP. But inside of OpenMP segment you cannot control in which order the threads are executed -- so you longer the same sequence. I have the same problem with a package under development where I would like to have reproducible draws yet use OpenMP. But it seems you can't.

Resources