Blas daxpy routine with matrices - matrix

I am working on some matrices related problems in c++. I want to solve the problem: Y = aX + Y, where X and Y are matrices and a is a constant. I thought about using the daxpy BLAS routine, however, DAXPY according to the documentation is a vectors routine and I am not getting the same results as when I solve the same problem in matlab.
I am currently running this:
F77NAME(daxpy)(N, a, X, 1, Y, 1);

When you need to perform operation Y=a*X+Y it does not matter if X',Y` are 1D or 2D matrices, since the operation is done element-wise.
So, If you allocated the matrices in single pointers double A[] = new[] (M*N);, then you can use daxpy by defining the dimension of the vector as M*N
int MN = M*N;
int one = 1;
F77NAME(daxpy)(&MN, &a, &X, &one, &Y, &one);
Same goes with stack two dimension matrix double A[3][2]; as this memory is allocated in sequence.
Otherwise, you need to use a for loop and add each row separately.

Related

PyOpenCL - Multi-dimensional reduction kernel

I'm a total newbie to OpenCL.
I'm trying to code a reduction kernel that sums along one axis for a multi-dimensional array. I have stumbled upon that code which comes from here: https://tmramalho.github.io/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
__kernel void reduce(__global float *a, __global float *r, __local float *b) {
uint gid = get_global_id(0);
uint wid = get_group_id(0);
uint lid = get_local_id(0);
uint gs = get_local_size(0);
b[lid] = a[gid];
barrier(CLK_LOCAL_MEM_FENCE);
for(uint s = gs/2; s > 0; s >>= 1) {
if(lid < s) {
b[lid] += b[lid+s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) r[wid] = b[lid];
}
I don't understand the for loop part. I get that uint s = gs/2 means that we split the array in half, but then it is a complete mystery. Without understanding it, I can't really implement another version for taking the maximum of an array for instance, even less for multi-dimensional arrays.
Furthermore, as far as I understand, the reduce kernel needs to be rerun another time if "N is bigger than the number of cores in a single unit".
Could you give me further explanations on that whole piece of code? Or even guidance on how to implement it for taking the max of an array?
Complete code can be found here: https://github.com/tmramalho/easy-pyopencl/blob/master/008_localreduce.py
Your first question about the meaning of the for loop:
for(uint s = gs/2; s > 0; s >>= 1)
It means that you divide the local size gs by 2, and keep dividing by 2 (the shift part s >>= 1 is equivalent to s = s/2) while s > 0, in other words, until s = 1. This algorithm depends on your array's size being a power of 2, otherwise you'd have to deal with the excess of a power of 2 until you have reduced the whole array, or you'd have to fill your array with neutral values for the reduction until completing a power of 2 size.
Your second concern when N is bigger than the capacity of your GPU, you are right: you have to run your reduction in portions that fit and then merge the results.
Finally, when you ask for guidance on how to implement a reduction to get the max of an array, I would suggest the following:
For a simple reduction like max or sum, try using numpy, especially if you are dealing with programming the reduction by axis.
If you think that the GPU would give you an advantage, try first using pyopencl's Multidimensional Array functionality, e.g. max.
If the reduction is more math intensive, try using pyopencl's Parallel Algorithms, e.g. reduction
I think that the whole point of using pyopencl is to avoid dealing with the underlying GPU's architecture. Otherwise, it is easier to deal with CUDA or HIP directly instead of OpenCL.

What is the best way to parallelize Hough Transform algorithm?

In Hough Line Transform, for each edge pixel, we find the corresponding Rho and Theta in the Hough parameter space. The accumulator for Rho and Theta should be global. If we want to parallelize the algorithm, what is the best way to split the accumulator space?
What is the best way to parallelise an algorithm may depend on several aspects. An important such aspect is the hardware that you are targeting. As you have tagged your question with "openmp", I assume that, in your case, the target is an SMP system.
To answer your question, let us start by looking at a typical, straightforward implementation of the Hough transform (I will use C, but what follows applies to C++ and Fortran as well):
size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit)
{
*rlimit = (size_t)(sqrt(w * w + h * h));
double step = M_PI_2 / res;
size_t *accum = calloc(res * *rlimit, sizeof(size_t));
size_t x, y, t;
for (x = 0; x < w; ++x)
for (y = 0; y < h; ++y)
if (pixels[y * w + x])
for (t = 0; t < res; ++t)
{
double theta = t * step;
size_t r = x * cos(theta) + y * sin(theta);
++accum[r * res + t];
}
return accum;
}
Given an array of black-and-white pixels (stored row-wise), a width, a height and a target resolution for the angle-component of the Hough space, the function hough returns an accumulator array for the Hough space (organised "angle-wise") and stores an upper bound for its distance dimension in the output argument rlimit. That is, the number of elements in the returned accumulator array is given by res * (*rlimit).
The body of the function centres on three nested loops: the two outermost loops iterate over the input pixels, while the conditionally executed innermost loop iterates over the angular dimension of the Hough space.
To parallelise the algorithm, we have to somehow decompose it into pieces that can execute concurrently. Typically such a decomposition is induced either by the structure of the computation or otherwise by the structure of the data that are operated on.
As, besides iteration, the only computationally interesting task that is carried out by the function is the trigonometry in the body of the innermost loop, there are no obvious opportunities for a decomposition based on the structure of the computation. Therefore, let us focus on decompositions based on the structure of the data, and let us distinguish between
data decompositions that are based on the structure of the input data, and
data decompositions that are based on the structure of the output data.
The structure of the input data, in our case, is given by the pixel array that is passed as an argument to the function hough and that is iterated over by the two outermost loops in the body of the function.
The structure of the output data is given by the structure of the returned accumulator array and is iterated over by the innermost loop in the body of the function.
We first look at output-data decomposition as, for the Hough transform, it leads to the simplest parallel algorithm.
Output-data decomposition
Decomposing the output data into units that can be produced relatively independently materialises into having the iterations of the innermost loop execute in parallel.
Doing so, one has to take into account any so-called loop-carried dependencies for the loop to parallelise. In this case, this is straightforward as there are no such loop-carried dependencies: all iterations of the loop require read-write accesses to the shared array accum, but each iteration operates on its own "private" segment of the array (i.e., those elements that have indices i with i % res == t).
Using OpenMP, this gives us the following straightforward parallel implementation:
size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit)
{
*rlimit = (size_t)(sqrt(w * w + h * h));
double step = M_PI_2 / res;
size_t *accum = calloc(res * *rlimit, sizeof(size_t));
size_t x, y, t;
for (x = 0; x < w; ++x)
for (y = 0; y < h; ++y)
if (pixels[y * w + x])
#pragma omp parallel for
for (t = 0; t < res; ++t)
{
double theta = t * step;
size_t r = x * cos(theta) + y * sin(theta);
++accum[r * res + t];
}
return accum;
}
Input-data decomposition
A data decomposition that follows the structure of the input data can be obtained by parallelising the outermost loop.
That loop, however, does have a loop-carried flow dependency as each loop iteration potentially requires read-write access to each cell of the shared accumulator array. Hence, in order to obtain a correct parallel implementation we have to synchronise these accumulator accesses. In this case, this can easily be done by updating the accumulators atomically.
The loop also carries two so-called antidependencies. These are induced by the induction variables y and t of the inner loops and are trivially dealt with by making them private variables of the parallel outer loop.
The parallel implementation that we end up with then looks like this:
size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit)
{
*rlimit = (size_t)(sqrt(w * w + h * h));
double step = M_PI_2 / res;
size_t *accum = calloc(res * *rlimit, sizeof(size_t));
size_t x, y, t;
#pragma omp parallel for private(y, t)
for (x = 0; x < w; ++x)
for (y = 0; y < h; ++y)
if (pixels[y * w + x])
for (t = 0; t < res; ++t)
{
double theta = t * step;
size_t r = x * cos(theta) + y * sin(theta);
#pragma omp atomic
++accum[r * res + t];
}
return accum;
}
Evaluation
Evaluating the two data-decomposition strategies, we observe that:
For both strategies, we end up with a parallelisation in which the computationally heavy parts of the algorithm (the trigonometry) are nicely distributed over threads.
Decomposing the output data gives us a parallelisation of the innermost loop in the function hough. As this loop does not have any loop-carried dependencies we do not incur any data-synchronisation overhead. However, as the innermost loop is executed for every set input pixel, we do incur quite a lot of overhead due to repeatedly forming a team of threads etc.
Decomposing the input data gives a parallelisation of the outermost loop. This loop is only executed once and so the threading overhead is minimal. On the other hand, however, we do incur some data-synchronisation overhead for dealing with a loop-carried flow dependency.
Atomic operations in OpenMP can typically be assumed to be quite efficient, while threading overheads are known to be rather large. Consequently, one expects that, for the Hough transform, input-data decomposition gives a more efficient parallel algorithm. This is confirmed by a simple experiment. For this experiment, I applied the two parallel implementations to a randomly generated 1024x768 black-and-white picture with a target resolution of 90 (i.e., 1 accumulator per degree of arc) and compared the results with the sequential implementation. This table shows the relative speedups obtained by the two parallel implementations for different numbers of threads:
# threads | OUTPUT DECOMPOSITION | INPUT DECOMPOSITION
----------+----------------------+--------------------
2 | 1.2 | 1.9
4 | 1.4 | 3.7
8 | 1.5 | 6.8
(The experiment was carried out on a otherwise unloaded dual 2.2 GHz quad-core Intel Xeon E5520. All speedups are averages over five runs. The average running time of the sequential implementation was 2.66 s.)
False sharing
Note that the parallel implementations are susceptible to false sharing of the accumulator array. For the implementation that is based on decomposition of the output data this false sharing can to a large extent be avoided by transposing the accumulator array (i.e., by organising it "distance-wise"). Doing so and measuring the impact, did, in my experiments, not result in any observable further speedups.
Conclusion
Returning to your question, "what is the best way to split the accumulator space?", the answer seems to be that it is best not to split the accumulator space at all, but instead split the input space.
If, for some reason, you have set your mind on splitting the accumulator space, you may consider changing the structure of the algorithm so that the outermost loops iterate over the Hough space and the inner loop over whichever is the smallest of the input picture's dimensions. That way, you can still derive a parallel implementation that incurs threading overhead only once and that comes free of data-synchronisation overhead. However, in that scheme, the trigonometry can no longer be conditional and so, in total, each loop iteration will have to do more work than in the scheme above.

Combinations of integers in OpenCL

I have a bunch of vectors (~500). I need to find triple products of all the combinations of the vectors in OpenCL. There are plenty of combination algorithms (r out of n things) in C++ but I am yet to find any implemented for GPU. I have seen quite a few parallel permutation algorithms in Cuda but I just want to know if there are any viable combination algorithms present?
I'll need to guess a bit here and there to answer your question.
I suppose you have an array V of n (~500) vectors. These vectors are all of same dimensionality m (probably m=3).
What you want is the component wise product of each 3 vectors vi, vj, vk where i,j,k in {0,..,n-1}.
Simple 3-dimensional example:
result[idx].x = V[i].x * V[j].x * V[k].x;
result[idx].y = V[i].y * V[j].y * V[k].y;
result[idx].z = V[i].z * V[j].z * V[k].z;
Now maybe your vectors are not 3-dimensional and maybe you don't want the component wise product but the sum of it (like in dot product), but I'm sure you're able to djust the code accordingly.
The real question here is how to compute all possible i,j,k and idx. Correct?
Now with CUDA you are in a very fortunate position. You can just launch n*n*n threads in a grid and therefore get i,j,k for free without having to think about ways to compute combinations or permutations at all. Just do the following:
dim3 grid, block;
block.x = n;
block.y = 1;
block z = 1;
grid.x = n;
grid.y = n;
grid.z = 1;
compute_product_kernel<<<grid, block>>>( V, result );
This way you'll launch n*n blocks of n threads. Computing i,j,k becomes trivial, computing idx is easy:
__device__ void compute_product_kernel( myVector* V, myVector* result)
{
int i = blockIdx.x;
int j = blockIdx.y;
int k = threadIdx.x;
int idx = i * gridDim.y * blockDim.x + j * blockDim.x + k;
...
}
Of course all of this only works because your n is within the limits of CUDA's block and grid range.
Two more things though:
Maybe you want permutations instead of combinations. You could do that by skipping every combination where any two of i,j,k are the same. But I'd recommend keeping them anyway because computing when to skip is probably more expensive that doing the actual work. Also I'd advise against using the permutation to save memory for result because it would save you less that 1% and make the calculation much more complex.
Are you sure you've got enough memory to actually do this? Storing the result requires n*n*n*m*sizeof(float) bytes. With n=500 and m=3 that would already be 1.5 GB. Is that really what you are looking for? Maybe the next step of your processing can be combined into the calculation so that storing the intermediate result is not neccessary.

Multiplying hundreds of matrices using cuda

I am writing a program which requires to multiply hundreds of matrices in parallel using CUDA. Can somebody explain how to perform this operation.
I have seen that Kepler architecture is capable of dynamic parallelism. Has somebody used this architecture and if yes, which Nvidia graphics card.
The easiest way to get fast performing matrix multiply in parallel using CUDA is through the ArrayFire CUDA library using the GFOR loop. Here's some code that does what you want:
int n = 8, int m = 8; // dimensions
int t = 10; // number of different matricies
array A = randu(m,n,t); // many matricies
array B = randu(m,n); // one matrix
array C = zeros(m,n,t); // destination
// multiply C=A*B for all A, at the same time
gfor (array i, A.dims(2)) {
C(span,span,i) = matmul(A(span,span,i), B);
}
print( A );
print( B );
print( C );
ArrayFire automatically tiles out the computation efficiently for execution on the GPU. All that is optimized behind the scenes for you. I find it to be faster than trying to write it by hand myself.

Performance of swapping two elements in MATLAB

Purely as an experiment, I'm writing sort functions in MATLAB then running these through the MATLAB profiler. The aspect I find most perplexing is to do with swapping elements.
I've found that the "official" way of swapping two elements in a matrix
self.Data([i1, i2]) = self.Data([i2, i1])
runs much slower than doing it in four lines of code:
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
The total length of time taken up by the second example is 12 times less than the single line of code in the first example.
Would somebody have an explanation as to why?
Based on suggestions posted, I've run some more tests.
It appears the performance hit comes when the same matrix is referenced in both the LHS and RHS of the assignment.
My theory is that MATLAB uses an internal reference-counting / copy-on-write mechanism, and this is causing the entire matrix to be copied internally when it's referenced on both sides. (This is a guess because I don't know the MATLAB internals).
Here are the results from calling the function 885548 times. (The difference here is times four, not times twelve as I originally posted. Each of the functions have the additional function-wrapping overhead, while in my initial post I just summed up the individual lines).
swap1: 12.547 s
swap2: 14.301 s
swap3: 51.739 s
Here's the code:
methods (Access = public)
function swap(self, i1, i2)
swap1(self, i1, i2);
swap2(self, i1, i2);
swap3(self, i1, i2);
self.SwapCount = self.SwapCount + 1;
end
end
methods (Access = private)
%
% swap1: stores values in temporary doubles
% This has the best performance
%
function swap1(self, i1, i2)
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
end
%
% swap2: stores values in a temporary matrix
% Marginally slower than swap1
%
function swap2(self, i1, i2)
m = self.Data([i1, i2]);
self.Data([i2, i1]) = m;
end
%
% swap3: does not use variables for storage.
% This has the worst performance
%
function swap3(self, i1, i2)
self.Data([i1, i2]) = self.Data([i2, i1]);
end
end
In the first (slow) approach, the RHS value is a matrix, so I think MATLAB incurs a performance penalty in creating a new matrix to store the two elements. The second (fast) approach avoids this by working directly with the elements.
Check out the "Techniques for Improving Performance" article on MathWorks for ways to improve your MATLAB code.
you could also do:
tmp = self.Data(i1);
self.Data(i1) = self.Data(i2);
self.Data(i2) = tmp;
Zach is potentially right in that a temporary copy of the matrix may be made to perform the first operation, although I would hazard a guess that there is some internal optimization within MATLAB that attempts to avoid this. It may be a function of the version of MATLAB you are using. I tried both of your cases in version 7.1.0.246 (a couple years old) and only saw a speed difference of about 2-2.5.
It's possible that this may be an example of speed improvement by what's called "loop unrolling". When doing vector operations, at some level within the internal code there is likely a FOR loop which loops over the indices you are swapping. By performing the scalar operations in the second example, you are avoiding any overhead from loops. Note these two (somewhat silly) examples:
vec = [1 2 3 4];
%Example 1:
for i = 1:4,
vec(i) = vec(i)+1;
end;
%Example 2:
vec(1) = vec(1)+1;
vec(2) = vec(2)+1;
vec(3) = vec(3)+1;
vec(4) = vec(4)+1;
Admittedly, it would be much easier to simply use vector operations like:
vec = vec+1;
but the examples above are for the purpose of illustration. When I repeat each example multiple times over and time them, Example 2 is actually somewhat faster than Example 1. For a small loop with a known number (in the example, just 4), it can actually be more efficient to forgo the loop. Of course, in this particular example, the vector operation given above is actually the fastest.
I usually follow this rule: Try a few different things, and pick the fastest for your specific problem.
This post deserves an update, since the JIT compiler is now a thing (since R2015b) and so is timeit (since R2013b) for more reliable function timing.
Below is a short benchmarking function for element swapping within a large array.
I have used the terms "directly swapping" and "using a temporary variable" to describe the two methods in the question respectively.
The results are pretty staggering, the performance of directly swapping 2 elements using is increasingly poor by comparison to using a temporary variable.
function benchie()
% Variables for plotting, loop to increase size of the arrays
M = 15; D = zeros(1,M); W = zeros(1,M);
for n = 1:M;
N = 2^n;
% Create some random array of length N, and random indices to swap
v = rand(N,1);
x = randi([1, N], N, 1);
y = randi([1, N], N, 1);
% Time the functions
D(n) = timeit(#()direct);
W(n) = timeit(#()withtemp);
end
% Plotting
plot(2.^(1:M), D, 2.^(1:M), W);
legend('direct', 'with temp')
xlabel('number of elements'); ylabel('time (s)')
function direct()
% Direct swapping of two elements
for k = 1:N
v([x(k) y(k)]) = v([y(k) x(k)]);
end
end
function withtemp()
% Using an intermediate temporary variable
for k = 1:N
tmp = v(y(k));
v(y(k)) = v(x(k));
v(x(k)) = tmp;
end
end
end

Resources