Multiplying hundreds of matrices using cuda - matrix

I am writing a program which requires to multiply hundreds of matrices in parallel using CUDA. Can somebody explain how to perform this operation.
I have seen that Kepler architecture is capable of dynamic parallelism. Has somebody used this architecture and if yes, which Nvidia graphics card.

The easiest way to get fast performing matrix multiply in parallel using CUDA is through the ArrayFire CUDA library using the GFOR loop. Here's some code that does what you want:
int n = 8, int m = 8; // dimensions
int t = 10; // number of different matricies
array A = randu(m,n,t); // many matricies
array B = randu(m,n); // one matrix
array C = zeros(m,n,t); // destination
// multiply C=A*B for all A, at the same time
gfor (array i, A.dims(2)) {
C(span,span,i) = matmul(A(span,span,i), B);
}
print( A );
print( B );
print( C );
ArrayFire automatically tiles out the computation efficiently for execution on the GPU. All that is optimized behind the scenes for you. I find it to be faster than trying to write it by hand myself.

Related

PyOpenCL - Multi-dimensional reduction kernel

I'm a total newbie to OpenCL.
I'm trying to code a reduction kernel that sums along one axis for a multi-dimensional array. I have stumbled upon that code which comes from here: https://tmramalho.github.io/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
__kernel void reduce(__global float *a, __global float *r, __local float *b) {
uint gid = get_global_id(0);
uint wid = get_group_id(0);
uint lid = get_local_id(0);
uint gs = get_local_size(0);
b[lid] = a[gid];
barrier(CLK_LOCAL_MEM_FENCE);
for(uint s = gs/2; s > 0; s >>= 1) {
if(lid < s) {
b[lid] += b[lid+s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) r[wid] = b[lid];
}
I don't understand the for loop part. I get that uint s = gs/2 means that we split the array in half, but then it is a complete mystery. Without understanding it, I can't really implement another version for taking the maximum of an array for instance, even less for multi-dimensional arrays.
Furthermore, as far as I understand, the reduce kernel needs to be rerun another time if "N is bigger than the number of cores in a single unit".
Could you give me further explanations on that whole piece of code? Or even guidance on how to implement it for taking the max of an array?
Complete code can be found here: https://github.com/tmramalho/easy-pyopencl/blob/master/008_localreduce.py
Your first question about the meaning of the for loop:
for(uint s = gs/2; s > 0; s >>= 1)
It means that you divide the local size gs by 2, and keep dividing by 2 (the shift part s >>= 1 is equivalent to s = s/2) while s > 0, in other words, until s = 1. This algorithm depends on your array's size being a power of 2, otherwise you'd have to deal with the excess of a power of 2 until you have reduced the whole array, or you'd have to fill your array with neutral values for the reduction until completing a power of 2 size.
Your second concern when N is bigger than the capacity of your GPU, you are right: you have to run your reduction in portions that fit and then merge the results.
Finally, when you ask for guidance on how to implement a reduction to get the max of an array, I would suggest the following:
For a simple reduction like max or sum, try using numpy, especially if you are dealing with programming the reduction by axis.
If you think that the GPU would give you an advantage, try first using pyopencl's Multidimensional Array functionality, e.g. max.
If the reduction is more math intensive, try using pyopencl's Parallel Algorithms, e.g. reduction
I think that the whole point of using pyopencl is to avoid dealing with the underlying GPU's architecture. Otherwise, it is easier to deal with CUDA or HIP directly instead of OpenCL.

Blas daxpy routine with matrices

I am working on some matrices related problems in c++. I want to solve the problem: Y = aX + Y, where X and Y are matrices and a is a constant. I thought about using the daxpy BLAS routine, however, DAXPY according to the documentation is a vectors routine and I am not getting the same results as when I solve the same problem in matlab.
I am currently running this:
F77NAME(daxpy)(N, a, X, 1, Y, 1);
When you need to perform operation Y=a*X+Y it does not matter if X',Y` are 1D or 2D matrices, since the operation is done element-wise.
So, If you allocated the matrices in single pointers double A[] = new[] (M*N);, then you can use daxpy by defining the dimension of the vector as M*N
int MN = M*N;
int one = 1;
F77NAME(daxpy)(&MN, &a, &X, &one, &Y, &one);
Same goes with stack two dimension matrix double A[3][2]; as this memory is allocated in sequence.
Otherwise, you need to use a for loop and add each row separately.

How to multiply 100 Integers with another 100 integers in parallel on a GPU using PyOpenCL?

There are many PyopenCL examples on doing arithmetic operations on vectors of size 4. if I have to multiply 100 integers with another 100 integers all at once using AMD GPU on Mac through PyOpenCL, can someone provide and explain the code please? Since max vector size can be 16, I would like to know how can I ask the GPU to do this operation which needs processing more than 16 integers in parallel.
I have a AMD D500 firepro GPU.
Does every Work Item( thread) perform a task independently, if yes there are 24 compute units and each compute unit has 255 work items for single dimension and [255,255,255] for three dimension. Does it mean my GPU has 6120 independent work items?
I made a short example for an entry-wise multiplication of two one-dimensional integer arrays. Note that if you plan to multiply only 100 values it will not be faster than doing this on the CPU, since you have a lot of overhead with copying the data and so on.
import pyopencl as cl
import numpy as np
#this is compiled by the GPU driver and will be executed on the GPU
kernelsource = """
__kernel void multInt( __global int* res,
__global int* a,
__global int* b){
int i = get_global_id(0);
int N = get_global_size(0); //this is the dimension given as second argument in the kernel execution
res[i] = a[i] * b[i];
}
"""
device = cl.get_platforms()[0].get_devices()[0]
context = cl.Context([device])
program = cl.Program(context, kernelsource).build()
queue = cl.CommandQueue(context)
#preparing input data in numpy arrays in local memory (i.e. accessible by the CPU)
N = 100
a_local = np.array(range(N)).astype(np.int32)
b_local = (np.ones(N)*10).astype(np.int32)
#preparing result buffer in local memory
res_local = np.zeros(N).astype(np.int32)
#copy input data to GPU-memory
a_buf = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=a_local)
b_buf = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=b_local)
#prepare result buffer in GPU-memory
res_buf = cl.Buffer(context, cl.mem_flags.WRITE_ONLY, res_local.nbytes)
#execute previously compiled kernel on GPU
program.multInt(queue,(N,), None, res_buf, a_buf, b_buf)
#copy the result from GPU-memory to CPU-memory
cl.enqueue_copy(queue, res_local, res_buf)
print("result: {}".format(res_local))
For the documentation of PyOpenCL: Once you have understood the working principle of GPGPU programming and the programming concepts of OpenCL, PyOpenCL is pretty straightforward.

Combinations of integers in OpenCL

I have a bunch of vectors (~500). I need to find triple products of all the combinations of the vectors in OpenCL. There are plenty of combination algorithms (r out of n things) in C++ but I am yet to find any implemented for GPU. I have seen quite a few parallel permutation algorithms in Cuda but I just want to know if there are any viable combination algorithms present?
I'll need to guess a bit here and there to answer your question.
I suppose you have an array V of n (~500) vectors. These vectors are all of same dimensionality m (probably m=3).
What you want is the component wise product of each 3 vectors vi, vj, vk where i,j,k in {0,..,n-1}.
Simple 3-dimensional example:
result[idx].x = V[i].x * V[j].x * V[k].x;
result[idx].y = V[i].y * V[j].y * V[k].y;
result[idx].z = V[i].z * V[j].z * V[k].z;
Now maybe your vectors are not 3-dimensional and maybe you don't want the component wise product but the sum of it (like in dot product), but I'm sure you're able to djust the code accordingly.
The real question here is how to compute all possible i,j,k and idx. Correct?
Now with CUDA you are in a very fortunate position. You can just launch n*n*n threads in a grid and therefore get i,j,k for free without having to think about ways to compute combinations or permutations at all. Just do the following:
dim3 grid, block;
block.x = n;
block.y = 1;
block z = 1;
grid.x = n;
grid.y = n;
grid.z = 1;
compute_product_kernel<<<grid, block>>>( V, result );
This way you'll launch n*n blocks of n threads. Computing i,j,k becomes trivial, computing idx is easy:
__device__ void compute_product_kernel( myVector* V, myVector* result)
{
int i = blockIdx.x;
int j = blockIdx.y;
int k = threadIdx.x;
int idx = i * gridDim.y * blockDim.x + j * blockDim.x + k;
...
}
Of course all of this only works because your n is within the limits of CUDA's block and grid range.
Two more things though:
Maybe you want permutations instead of combinations. You could do that by skipping every combination where any two of i,j,k are the same. But I'd recommend keeping them anyway because computing when to skip is probably more expensive that doing the actual work. Also I'd advise against using the permutation to save memory for result because it would save you less that 1% and make the calculation much more complex.
Are you sure you've got enough memory to actually do this? Storing the result requires n*n*n*m*sizeof(float) bytes. With n=500 and m=3 that would already be 1.5 GB. Is that really what you are looking for? Maybe the next step of your processing can be combined into the calculation so that storing the intermediate result is not neccessary.

coding with vectors using the Accelerate framework

I'm playing around with the Accelerate framework for the first time with the goal of implementing some vectorized code into an iOS application. I've never tried to do anything with respect to working with vectors in Objective C or C. Having some experience with MATLAB, I wonder if using Accelerate is indeed that much more of a pain. Suppose I'd want to calculate the following:
b = 4*(sin(a/2))^2 where a and b are vectors.
MATLAB code:
a = 1:4;
b = 4*(sin(a/2)).^2;
However, as I see it after some spitting through the documentation, things are quite different using Accelerate.
My C implementation:
float a[4] = {1,2,3,4}; //define a
int len = 4;
float div = 2; //define 2
float a2[len]; //define intermediate result 1
vDSP_vsdiv(a, 1, &div, a2, 1, len); //divide
float sinResult[len]; //define intermediate result 2
vvsinf(sinResult, a2, &len); //take sine
float sqResult[len]; //square the result
vDSP_vsq(sinResult, 1, sqResult, 1, len); //take square
float factor = 4; //multiply all this by four
float b[len]; //define answer vector
vDSP_vsmul(sqResult, 1, &factor, b, 1, len); //multiply
//unset all variables I didn't actually need
Honestly, I don't know what's worst here: keeping track of all intermediate steps, trying to memorize how the arguments are passed in vDSP with respect to VecLib (quite different), or that it takes so much time doing something quite trivial.
I really hope I am missing something here and that most steps can be merged or shortened. Any recommendations on coding resources, good coding habits (learned the hard way or from a book), etc. would be very welcome! How do you all deal with multiple lines of vector calculations?
I guess you could write it that way, but it seems awfully complicated to me. I like this better (intel-specific, but can easily be abstracted for other architectures):
#include <Accelerate/Accelerate.h>
#include <immintrin.h>
const __m128 a = {1,2,3,4};
const __m128 sina2 = vsinf(a*_mm_set1_ps(0.5));
const __m128 b = _mm_set1_ps(4)*sina2*sina2;
Also, just to be pedantic, what you're doing here is not linear algebra. Linear algebra involves only linear operations (no squaring, no transcendental operations like sin).
Edit: as you noted, the above won't quite work out of the box on iOS; the biggest issue is that there is no vsinf (vMathLib is not available in Accelerate on iOS). I don't have the SDK installed on my machine to test, but I believe that something like the following should work:
#include <Accelerate/Accelerate.h>
const vFloat a = {1, 2, 3, 4};
const vFloat a2 = a*(vFloat){0.5,0.5,0.5,0.5};
const int n = 4;
vFloat sina2;
vvsinf((float *)&sina2, (const float *)&a, &n);
const vFloat b = sina2*sina2*(vFloat){4,4,4,4};
Not quite as pretty as what is possible with vMathLib, but still fairly compact.
In general, a lot of basic arithmetic operations on vectors just work; there's no need to use calls to any library, which is why Accelerate doesn't go out of its way to supply those operations cleanly. Instead, Accelerate usually tries to provide operations that aren't immediately available by other means.
To answer my own question:
In iOS 6, vMathLib will be introduced. As Stephen clarified, vMathLib could already be used on OSX, but it was not available in iOS. Until now.
The functions that vMathLib provides will allow for easier vector calculations.

Resources