In Hough Line Transform, for each edge pixel, we find the corresponding Rho and Theta in the Hough parameter space. The accumulator for Rho and Theta should be global. If we want to parallelize the algorithm, what is the best way to split the accumulator space?
What is the best way to parallelise an algorithm may depend on several aspects. An important such aspect is the hardware that you are targeting. As you have tagged your question with "openmp", I assume that, in your case, the target is an SMP system.
To answer your question, let us start by looking at a typical, straightforward implementation of the Hough transform (I will use C, but what follows applies to C++ and Fortran as well):
size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit)
{
*rlimit = (size_t)(sqrt(w * w + h * h));
double step = M_PI_2 / res;
size_t *accum = calloc(res * *rlimit, sizeof(size_t));
size_t x, y, t;
for (x = 0; x < w; ++x)
for (y = 0; y < h; ++y)
if (pixels[y * w + x])
for (t = 0; t < res; ++t)
{
double theta = t * step;
size_t r = x * cos(theta) + y * sin(theta);
++accum[r * res + t];
}
return accum;
}
Given an array of black-and-white pixels (stored row-wise), a width, a height and a target resolution for the angle-component of the Hough space, the function hough returns an accumulator array for the Hough space (organised "angle-wise") and stores an upper bound for its distance dimension in the output argument rlimit. That is, the number of elements in the returned accumulator array is given by res * (*rlimit).
The body of the function centres on three nested loops: the two outermost loops iterate over the input pixels, while the conditionally executed innermost loop iterates over the angular dimension of the Hough space.
To parallelise the algorithm, we have to somehow decompose it into pieces that can execute concurrently. Typically such a decomposition is induced either by the structure of the computation or otherwise by the structure of the data that are operated on.
As, besides iteration, the only computationally interesting task that is carried out by the function is the trigonometry in the body of the innermost loop, there are no obvious opportunities for a decomposition based on the structure of the computation. Therefore, let us focus on decompositions based on the structure of the data, and let us distinguish between
data decompositions that are based on the structure of the input data, and
data decompositions that are based on the structure of the output data.
The structure of the input data, in our case, is given by the pixel array that is passed as an argument to the function hough and that is iterated over by the two outermost loops in the body of the function.
The structure of the output data is given by the structure of the returned accumulator array and is iterated over by the innermost loop in the body of the function.
We first look at output-data decomposition as, for the Hough transform, it leads to the simplest parallel algorithm.
Output-data decomposition
Decomposing the output data into units that can be produced relatively independently materialises into having the iterations of the innermost loop execute in parallel.
Doing so, one has to take into account any so-called loop-carried dependencies for the loop to parallelise. In this case, this is straightforward as there are no such loop-carried dependencies: all iterations of the loop require read-write accesses to the shared array accum, but each iteration operates on its own "private" segment of the array (i.e., those elements that have indices i with i % res == t).
Using OpenMP, this gives us the following straightforward parallel implementation:
size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit)
{
*rlimit = (size_t)(sqrt(w * w + h * h));
double step = M_PI_2 / res;
size_t *accum = calloc(res * *rlimit, sizeof(size_t));
size_t x, y, t;
for (x = 0; x < w; ++x)
for (y = 0; y < h; ++y)
if (pixels[y * w + x])
#pragma omp parallel for
for (t = 0; t < res; ++t)
{
double theta = t * step;
size_t r = x * cos(theta) + y * sin(theta);
++accum[r * res + t];
}
return accum;
}
Input-data decomposition
A data decomposition that follows the structure of the input data can be obtained by parallelising the outermost loop.
That loop, however, does have a loop-carried flow dependency as each loop iteration potentially requires read-write access to each cell of the shared accumulator array. Hence, in order to obtain a correct parallel implementation we have to synchronise these accumulator accesses. In this case, this can easily be done by updating the accumulators atomically.
The loop also carries two so-called antidependencies. These are induced by the induction variables y and t of the inner loops and are trivially dealt with by making them private variables of the parallel outer loop.
The parallel implementation that we end up with then looks like this:
size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit)
{
*rlimit = (size_t)(sqrt(w * w + h * h));
double step = M_PI_2 / res;
size_t *accum = calloc(res * *rlimit, sizeof(size_t));
size_t x, y, t;
#pragma omp parallel for private(y, t)
for (x = 0; x < w; ++x)
for (y = 0; y < h; ++y)
if (pixels[y * w + x])
for (t = 0; t < res; ++t)
{
double theta = t * step;
size_t r = x * cos(theta) + y * sin(theta);
#pragma omp atomic
++accum[r * res + t];
}
return accum;
}
Evaluation
Evaluating the two data-decomposition strategies, we observe that:
For both strategies, we end up with a parallelisation in which the computationally heavy parts of the algorithm (the trigonometry) are nicely distributed over threads.
Decomposing the output data gives us a parallelisation of the innermost loop in the function hough. As this loop does not have any loop-carried dependencies we do not incur any data-synchronisation overhead. However, as the innermost loop is executed for every set input pixel, we do incur quite a lot of overhead due to repeatedly forming a team of threads etc.
Decomposing the input data gives a parallelisation of the outermost loop. This loop is only executed once and so the threading overhead is minimal. On the other hand, however, we do incur some data-synchronisation overhead for dealing with a loop-carried flow dependency.
Atomic operations in OpenMP can typically be assumed to be quite efficient, while threading overheads are known to be rather large. Consequently, one expects that, for the Hough transform, input-data decomposition gives a more efficient parallel algorithm. This is confirmed by a simple experiment. For this experiment, I applied the two parallel implementations to a randomly generated 1024x768 black-and-white picture with a target resolution of 90 (i.e., 1 accumulator per degree of arc) and compared the results with the sequential implementation. This table shows the relative speedups obtained by the two parallel implementations for different numbers of threads:
# threads | OUTPUT DECOMPOSITION | INPUT DECOMPOSITION
----------+----------------------+--------------------
2 | 1.2 | 1.9
4 | 1.4 | 3.7
8 | 1.5 | 6.8
(The experiment was carried out on a otherwise unloaded dual 2.2 GHz quad-core Intel Xeon E5520. All speedups are averages over five runs. The average running time of the sequential implementation was 2.66 s.)
False sharing
Note that the parallel implementations are susceptible to false sharing of the accumulator array. For the implementation that is based on decomposition of the output data this false sharing can to a large extent be avoided by transposing the accumulator array (i.e., by organising it "distance-wise"). Doing so and measuring the impact, did, in my experiments, not result in any observable further speedups.
Conclusion
Returning to your question, "what is the best way to split the accumulator space?", the answer seems to be that it is best not to split the accumulator space at all, but instead split the input space.
If, for some reason, you have set your mind on splitting the accumulator space, you may consider changing the structure of the algorithm so that the outermost loops iterate over the Hough space and the inner loop over whichever is the smallest of the input picture's dimensions. That way, you can still derive a parallel implementation that incurs threading overhead only once and that comes free of data-synchronisation overhead. However, in that scheme, the trigonometry can no longer be conditional and so, in total, each loop iteration will have to do more work than in the scheme above.
Related
I am working on some matrices related problems in c++. I want to solve the problem: Y = aX + Y, where X and Y are matrices and a is a constant. I thought about using the daxpy BLAS routine, however, DAXPY according to the documentation is a vectors routine and I am not getting the same results as when I solve the same problem in matlab.
I am currently running this:
F77NAME(daxpy)(N, a, X, 1, Y, 1);
When you need to perform operation Y=a*X+Y it does not matter if X',Y` are 1D or 2D matrices, since the operation is done element-wise.
So, If you allocated the matrices in single pointers double A[] = new[] (M*N);, then you can use daxpy by defining the dimension of the vector as M*N
int MN = M*N;
int one = 1;
F77NAME(daxpy)(&MN, &a, &X, &one, &Y, &one);
Same goes with stack two dimension matrix double A[3][2]; as this memory is allocated in sequence.
Otherwise, you need to use a for loop and add each row separately.
I have a bunch of vectors (~500). I need to find triple products of all the combinations of the vectors in OpenCL. There are plenty of combination algorithms (r out of n things) in C++ but I am yet to find any implemented for GPU. I have seen quite a few parallel permutation algorithms in Cuda but I just want to know if there are any viable combination algorithms present?
I'll need to guess a bit here and there to answer your question.
I suppose you have an array V of n (~500) vectors. These vectors are all of same dimensionality m (probably m=3).
What you want is the component wise product of each 3 vectors vi, vj, vk where i,j,k in {0,..,n-1}.
Simple 3-dimensional example:
result[idx].x = V[i].x * V[j].x * V[k].x;
result[idx].y = V[i].y * V[j].y * V[k].y;
result[idx].z = V[i].z * V[j].z * V[k].z;
Now maybe your vectors are not 3-dimensional and maybe you don't want the component wise product but the sum of it (like in dot product), but I'm sure you're able to djust the code accordingly.
The real question here is how to compute all possible i,j,k and idx. Correct?
Now with CUDA you are in a very fortunate position. You can just launch n*n*n threads in a grid and therefore get i,j,k for free without having to think about ways to compute combinations or permutations at all. Just do the following:
dim3 grid, block;
block.x = n;
block.y = 1;
block z = 1;
grid.x = n;
grid.y = n;
grid.z = 1;
compute_product_kernel<<<grid, block>>>( V, result );
This way you'll launch n*n blocks of n threads. Computing i,j,k becomes trivial, computing idx is easy:
__device__ void compute_product_kernel( myVector* V, myVector* result)
{
int i = blockIdx.x;
int j = blockIdx.y;
int k = threadIdx.x;
int idx = i * gridDim.y * blockDim.x + j * blockDim.x + k;
...
}
Of course all of this only works because your n is within the limits of CUDA's block and grid range.
Two more things though:
Maybe you want permutations instead of combinations. You could do that by skipping every combination where any two of i,j,k are the same. But I'd recommend keeping them anyway because computing when to skip is probably more expensive that doing the actual work. Also I'd advise against using the permutation to save memory for result because it would save you less that 1% and make the calculation much more complex.
Are you sure you've got enough memory to actually do this? Storing the result requires n*n*n*m*sizeof(float) bytes. With n=500 and m=3 that would already be 1.5 GB. Is that really what you are looking for? Maybe the next step of your processing can be combined into the calculation so that storing the intermediate result is not neccessary.
I'm writing a program for matrix multiplication with OpenMP, that, for cache convenience, implements the multiplication A x B(transpose) rows X rows instead of the classic A x B rows x columns, for better cache efficiency. Doing this I faced an interesting fact that for me is illogic: if in this code i parallelize the extern loop the program is slower than if I put the OpenMP directives in the most inner loop, in my computer the times are 10.9 vs 8.1 seconds.
//A and B are double* allocated with malloc, Nu is the lenght of the matrixes
//which are square
//#pragma omp parallel for
for (i=0; i<Nu; i++){
for (j=0; j<Nu; j++){
*(C+(i*Nu+j)) = 0.;
#pragma omp parallel for
for(k=0;k<Nu ;k++){
*(C+(i*Nu+j))+=*(A+(i*Nu+k)) * *(B+(j*Nu+k));//C(i,j)=sum(over k) A(i,k)*B(k,j)
}
}
}
Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.
Also, use of restrict may help. Otherwise the compiler can't guarantee that writes to C aren't changing A and B.
Try:
for (i=0; i<Nu; i++){
const double* const Arow = A + i*Nu;
double* const Crow = C + i*Nu;
#pragma omp parallel for
for (j=0; j<Nu; j++){
const double* const Bcol = B + j*Nu;
double sum = 0.0;
for(k=0;k<Nu ;k++){
sum += Arow[k] * Bcol[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
}
Crow[j] = sum;
}
}
Also, I think Elalfer is right about needing reduction if you parallelize the innermost loop.
You could probably have some dependencies in the data when you parallelize the outer loop and compiler is not able to figure it out and adds additional locks.
Most probably it decides that different outer loop iterations could write into the same (C+(i*Nu+j)) and it adds access locks to protect it.
Compiler could probably figure out that there are no dependencies if you'll parallelize the 2nd loop. But figuring out that there are no dependencies parallelizing the outer loop is not so trivial for a compiler.
UPDATE
Some performance measurements.
Hi again. It looks like 1000 double * and + is not enough to cover the cost of threads synchronization.
I've done few small tests and simple vector scalar multiplication is not effective with openmp unless the number of elements is less than ~10'000. Basically, larger your array is, more performance will you get from using openmp.
So parallelizing the most inner loop you'll have to separate task between different threads and gather data back 1'000'000 times.
PS. Try Intel ICC, it is kinda free to use for students and open source projects. I remember being using openmp for smaller that 10'000 elements arrays.
UPDATE 2: Reduction example
double sum = 0.0;
int k=0;
double *al = A+i*Nu;
double *bl = A+j*Nu;
#pragma omp parallel for shared(al, bl) reduction(+:sum)
for(k=0;k<Nu ;k++){
sum +=al[k] * bl[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
}
C[i*Nu+j] = sum;
I have to calculate the following:
float2 y = CONSTANT;
for (int i = 0; i < totalN; i++)
h[i] = cos(y*i);
totalN is a large number, so I would like to make this in a more efficient way. Is there any way to improve this? I suspect there is, because, after all, we know what's the result of cos(n), for n=1..N, so maybe there's some theorem that allows me to compute this in a faster way. I would really appreciate any hint.
Thanks in advance,
Federico
Using one of the most beautiful formulas of mathematics, Euler's formula
exp(i*x) = cos(x) + i*sin(x),
substituting x := n * phi:
cos(n*phi) = Re( exp(i*n*phi) )
sin(n*phi) = Im( exp(i*n*phi) )
exp(i*n*phi) = exp(i*phi) ^ n
Power ^n is n repeated multiplications.
Therefore you can calculate cos(n*phi) and simultaneously sin(n*phi) by repeated complex multiplication by exp(i*phi) starting with (1+i*0).
Code examples:
Python:
from math import *
DEG2RAD = pi/180.0 # conversion factor degrees --> radians
phi = 10*DEG2RAD # constant e.g. 10 degrees
c = cos(phi)+1j*sin(phi) # = exp(1j*phi)
h=1+0j
for i in range(1,10):
h = h*c
print "%d %8.3f"%(i,h.real)
or C:
#include <stdio.h>
#include <math.h>
// numer of values to calculate:
#define N 10
// conversion factor degrees --> radians:
#define DEG2RAD (3.14159265/180.0)
// e.g. constant is 10 degrees:
#define PHI (10*DEG2RAD)
typedef struct
{
double re,im;
} complex_t;
int main(int argc, char **argv)
{
complex_t c;
complex_t h[N];
int index;
c.re=cos(PHI);
c.im=sin(PHI);
h[0].re=1.0;
h[0].im=0.0;
for(index=1; index<N; index++)
{
// complex multiplication h[index] = h[index-1] * c;
h[index].re=h[index-1].re*c.re - h[index-1].im*c.im;
h[index].im=h[index-1].re*c.im + h[index-1].im*c.re;
printf("%d: %8.3f\n",index,h[index].re);
}
}
I'm not sure what kind of accuracy vs. performance compromises you're willing to make, but there are extensive discussions of various sinusoid approximation techniques at these links:
Fun with Sinusoids - http://www.audiomulch.com/~rossb/code/sinusoids/
Fast and accurate sine/cosine - http://www.devmaster.net/forums/showthread.php?t=5784
Edit (I think this is the "Don Cross" link that's broken on the "Fun with Sinusoids" page):
Optimizing Trig Calculations - http://groovit.disjunkt.com/analog/time-domain/fasttrig.html
Maybe the simplest formula is
cos(n+y) = 2cos(n)cos(y) - cos(n-y).
If you precompute the constant 2*cos(y) then each value cos(n+y) can be computed from the previous 2 values with one single multiplication and one subtraction.
I.e., in pseudocode
h[0] = 1.0
h[1] = cos(y)
m = 2*h[1]
for (int i = 2; i < totalN; ++i)
h[i] = m*h[i-1] - h[i-2]
Here's a method, but it uses a little bit of memory for the sin. It uses the trig identities:
cos(a + b) = cos(a)cos(b)-sin(a)sin(b)
sin(a + b) = sin(a)cos(b)+cos(a)sin(b)
Then here's the code:
h[0] = 1.0;
double g1 = sin(y);
double glast = g1;
h[1] = cos(y);
for (int i = 2; i < totalN; i++){
h[i] = h[i-1]*h[1]-glast*g1;
glast = glast*h[1]+h[i-1]*g1;
}
If I didn't make any errors then that should do it. Of course there could be round-off problems so be aware of that. I implemented this in Python and it is quite accurate.
There are some good answers here but they are all recursive. Recursive calculation will not work for cosine function when using floating point arithmetic; you will invariably get rounding errors which quickly compound.
Consider calculation y = 45 degrees, totalN 10 000. You won't end up with 1 as the final result.
To address Kirk's concerns: all of the solutions based on the recurrence for cos and sin boil down to computing
x(k) = R x(k - 1),
where R is the matrix that rotates by y and x(0) is the unit vector (1, 0). If the true result for k - 1 is x'(k - 1) and the true result for k is x'(k), then the error goes from e(k - 1) = x(k - 1) - x'(k - 1) to e(k) = R x(k - 1) - R x'(k - 1) = R e(k - 1) by linearity. Since R is what's called an orthogonal matrix, R e(k - 1) has the same norm as e(k - 1), and the error grows very slowly. (The reason it grows at all is due to round-off; the computer representation of R is in general almost, but not quite orthogonal, so it will be necessary to restart the recurrence using the trig operations from time to time depending on the accuracy required. This is still much, much faster than using the trig ops to compute each value.)
You can do this using complex numbers.
if you define x = sin(y) + i cos(y), cos(y*i) will be the real part of x^i.
You can compute for all i iteratively. Complex multiply is 2 multiplies plus two adds.
Knowing cos(n) doesn't help -- your math library already does these kind of trivial things for you.
Knowing that cos((i+1)y)=cos(iy+y)=cos(iy)cos(y)-sin(iy)sin(y) can help, if you precompute cos(y) and sin(y), and keep track of both cos(iy) and sin(i*y) along the way. It may result in some loss of precision, though - you'll have to check.
How accurate do you need the resulting cos(x) to be? If you can live with some, you could create a lookup table, sampling the unit circle at 2*PI/N intervals and then interpolate between two adjacent points. N would be chosen to achieve some desired level of accuracy.
What I don't know is whether an interpolation is actually less costly than computing a cosine. Since its usually done in microcode in modern CPUs, it may not be.
Purely as an experiment, I'm writing sort functions in MATLAB then running these through the MATLAB profiler. The aspect I find most perplexing is to do with swapping elements.
I've found that the "official" way of swapping two elements in a matrix
self.Data([i1, i2]) = self.Data([i2, i1])
runs much slower than doing it in four lines of code:
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
The total length of time taken up by the second example is 12 times less than the single line of code in the first example.
Would somebody have an explanation as to why?
Based on suggestions posted, I've run some more tests.
It appears the performance hit comes when the same matrix is referenced in both the LHS and RHS of the assignment.
My theory is that MATLAB uses an internal reference-counting / copy-on-write mechanism, and this is causing the entire matrix to be copied internally when it's referenced on both sides. (This is a guess because I don't know the MATLAB internals).
Here are the results from calling the function 885548 times. (The difference here is times four, not times twelve as I originally posted. Each of the functions have the additional function-wrapping overhead, while in my initial post I just summed up the individual lines).
swap1: 12.547 s
swap2: 14.301 s
swap3: 51.739 s
Here's the code:
methods (Access = public)
function swap(self, i1, i2)
swap1(self, i1, i2);
swap2(self, i1, i2);
swap3(self, i1, i2);
self.SwapCount = self.SwapCount + 1;
end
end
methods (Access = private)
%
% swap1: stores values in temporary doubles
% This has the best performance
%
function swap1(self, i1, i2)
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
end
%
% swap2: stores values in a temporary matrix
% Marginally slower than swap1
%
function swap2(self, i1, i2)
m = self.Data([i1, i2]);
self.Data([i2, i1]) = m;
end
%
% swap3: does not use variables for storage.
% This has the worst performance
%
function swap3(self, i1, i2)
self.Data([i1, i2]) = self.Data([i2, i1]);
end
end
In the first (slow) approach, the RHS value is a matrix, so I think MATLAB incurs a performance penalty in creating a new matrix to store the two elements. The second (fast) approach avoids this by working directly with the elements.
Check out the "Techniques for Improving Performance" article on MathWorks for ways to improve your MATLAB code.
you could also do:
tmp = self.Data(i1);
self.Data(i1) = self.Data(i2);
self.Data(i2) = tmp;
Zach is potentially right in that a temporary copy of the matrix may be made to perform the first operation, although I would hazard a guess that there is some internal optimization within MATLAB that attempts to avoid this. It may be a function of the version of MATLAB you are using. I tried both of your cases in version 7.1.0.246 (a couple years old) and only saw a speed difference of about 2-2.5.
It's possible that this may be an example of speed improvement by what's called "loop unrolling". When doing vector operations, at some level within the internal code there is likely a FOR loop which loops over the indices you are swapping. By performing the scalar operations in the second example, you are avoiding any overhead from loops. Note these two (somewhat silly) examples:
vec = [1 2 3 4];
%Example 1:
for i = 1:4,
vec(i) = vec(i)+1;
end;
%Example 2:
vec(1) = vec(1)+1;
vec(2) = vec(2)+1;
vec(3) = vec(3)+1;
vec(4) = vec(4)+1;
Admittedly, it would be much easier to simply use vector operations like:
vec = vec+1;
but the examples above are for the purpose of illustration. When I repeat each example multiple times over and time them, Example 2 is actually somewhat faster than Example 1. For a small loop with a known number (in the example, just 4), it can actually be more efficient to forgo the loop. Of course, in this particular example, the vector operation given above is actually the fastest.
I usually follow this rule: Try a few different things, and pick the fastest for your specific problem.
This post deserves an update, since the JIT compiler is now a thing (since R2015b) and so is timeit (since R2013b) for more reliable function timing.
Below is a short benchmarking function for element swapping within a large array.
I have used the terms "directly swapping" and "using a temporary variable" to describe the two methods in the question respectively.
The results are pretty staggering, the performance of directly swapping 2 elements using is increasingly poor by comparison to using a temporary variable.
function benchie()
% Variables for plotting, loop to increase size of the arrays
M = 15; D = zeros(1,M); W = zeros(1,M);
for n = 1:M;
N = 2^n;
% Create some random array of length N, and random indices to swap
v = rand(N,1);
x = randi([1, N], N, 1);
y = randi([1, N], N, 1);
% Time the functions
D(n) = timeit(#()direct);
W(n) = timeit(#()withtemp);
end
% Plotting
plot(2.^(1:M), D, 2.^(1:M), W);
legend('direct', 'with temp')
xlabel('number of elements'); ylabel('time (s)')
function direct()
% Direct swapping of two elements
for k = 1:N
v([x(k) y(k)]) = v([y(k) x(k)]);
end
end
function withtemp()
% Using an intermediate temporary variable
for k = 1:N
tmp = v(y(k));
v(y(k)) = v(x(k));
v(x(k)) = tmp;
end
end
end