As a first step into OpenMP I set myself a challenge to parallelize some matrix decomposition algorithm. I picked Crout with pivoting, source can be found here:
http://www.mymathlib.com/c_source/matrices/linearsystems/crout_pivot.c
At the bottom of that decomposition function there's an outer for loop that walks over i and p_row at the same time. Of course OpenMP is as confused as I am when looking at this and refuses to do anything with it.
After wrapping my mind around it I think I got it untangled into readable form:
p_row = p_k + n;
for (i = k+1; i < n; i++) {
for (j = k+1; j < n; j++) *(p_row + j) -= *(p_row + k) * *(p_k + j);
p_row += n;
}
At this point serial run still comes up with the same result as the original code.
Then I add some pragmas, like this:
p_row = p_k + n;
#pragma omp parallel for private (i,j) shared (n,k,p_row,p_k)
for (i = k+1; i < n; i++) {
for (j = k+1; j < n; j++) *(p_row + j) -= *(p_row + k) * *(p_k + j);
#pragma omp critical
p_row += n;
#pragma omp flush(p_row)
}
Yet the results are essentially random.
What am I missing?
I haven't tested your adaptation of original code, but your program has several problems.
#pragma omp parallel for private (i,j) shared (n,k,p_row,p_k)
Default behavior is to have vars declared outside of scope shared, so the shared declaration is useless.
But these var should not be shared and rendered private.
n is unchanged during iterations, so better have a local copy
ditto for k and p_k
p_row is modified, but you really want several copies of p_row. This what will insure a proper parallel processing, so that each thread processes different rows. The problem is to compute p_row value in the different threads.
In the outer loop, iteration 0 will use p_row, second iteration p_row+n, iteration l p_row+l*n. Your iterations will be spread over several threads. Assume each thread processes m iterations. Thread 0 will process i=k+1 to i=m+(k+1) and p_row to p_row+m*n, thread 1 i=m+1+(k+1) to i=2m+(k+1) and p_row+n*(m+1) to p_row+2*m*n, etc. Hence you can compute the value that should have p_row at the start of the loop with the value of i.
Here is a possible implementation
p_row = p_k + n;
#pragma omp parallel for private(i,j) firstprivate(n, k, p_row, p_k)
// first private insures initial values are kept
{
for (i = k+1, p_row=p_row+(i-(k+1))*n; i < n; i++, p_row += n) {
for (j = k+1; j < n; j++)
*(p_row + j) -= *(p_row + k) * *(p_k + j);
}
p_row incrementation is in the for loop. This should continue to work in a sequential environment.
Critical is useless (and was buggy in your previous code). Flush is implicit at the end of a parallel section (and the pragma is just "omp flush").
This is a piece of code from a program. This code tends to sort the array
horses whose size is n. How does the array gap help in sorting the array horses?
int gaps[]={701,301,132,57,23,10,4,1};
for (k = 0; k < 8; k++)
for (i = gaps[k]; i < n; ++i)
{
temp = horses[i];
for (j = i; j >= gaps[k] && horses[j-gaps[k]] > temp; j -= gaps[k])
horses[j] = horses[j-gaps[k]];
horses[j] = temp;
}
gaps[] is an experimentally derived sequence for shell sort.
Take a look at the last entry in the wiki table for shell sort:
https://en.wikipedia.org/wiki/Shellsort#Gap_sequences
Wiki reference for this sequence:
https://oeis.org/A102549
I have these 2 codes, the question is to find how many times x=x+1 will run in each occasion as T1(n) stands for code 1 and T2(n) stands for code 2. Then I have to find the BIG O of each one, but I know how to do it, the thing is I get stuck in finding how many times ( as to n of course ) will x = x + 1 will run.
CODE 1:
for( i= 1; i <= n; i++)
{
for(j = 1; j <= sqrt(i); j++)
{
for( k = 1; k <= n - j + 1; k++)
{
x = x + 1;
}
}
}
CODE 2:
for(j = 1; j <= n; j++)
{
h = n;
while(h > 0)
{
for (i = 1; i <= sqrt(n); i++)
{
x = x+1;
}
h = h/2;
}
}
I am really stuck, and have read already a lot so I ask if someone can help me, please explain me analytically.
PS: I think in the code 2 , this for (i = 1; i <= sqrt(n); i++) will run n*log(n) times, right? Then what?
For code 1 you have that the number of calls of x=x+1 is:
Here we bounded 1+sqrt(2)+...+sqrt(n) with n sqrt(n) and used the fact that the first term is the leading term.
For code 2 the calculations are simpler:
The second loop actually goes from h=n to 0 by iterating h = h/2 but you can see that this is the same as going from 1 to log n. What we used is the fact the j, t, i are mutually independent (analogously just like we can write that sum from 1 to n of f(n) is just nf(n)).
I have noticed that there is a performance penalty associated with using anonymous functions in Julia. To illustrate I have two implementations of quicksort (taken from the micro performance benchmarks in the Julia distribution). The first sorts in ascending order
function qsort!(a,lo,hi)
i, j = lo, hi
while i < hi
pivot = a[(lo+hi)>>>1]
while i <= j
while a[i] < pivot; i += 1; end
while pivot < a[j]; j -= 1; end
if i <= j
a[i], a[j] = a[j], a[i]
i, j = i+1, j-1
end
end
if lo < j; qsort!(a,lo,j); end
lo, j = i, hi
end
return a
end
The second takes an additional parameter: an anonymous function that can be used to specify ascending or descending sort, or comparison for more exotic types
function qsort_generic!(a,lo,hi,op=(x,y)->x<y)
i, j = lo, hi
while i < hi
pivot = a[(lo+hi)>>>1]
while i <= j
while op(a[i], pivot); i += 1; end
while op(pivot, a[j]); j -= 1; end
if i <= j
a[i], a[j] = a[j], a[i]
i, j = i+1, j-1
end
end
if lo < j; qsort_generic!(a,lo,j,op); end
lo, j = i, hi
end
return a
end
There is a significant performance penalty when sorting Arrays of Int64, with the default version an order of magnitude faster. Here are times for sorting arrays of length N in seconds:
N qsort_generic qsort
2048 0.00125 0.00018
4096 0.00278 0.00029
8192 0.00615 0.00061
16384 0.01184 0.00119
32768 0.04482 0.00247
65536 0.07773 0.00490
The question is: Is this due to limitations in the compiler that will be ironed out in time, or is there an idiomatic way to pass functors/anonymous functions that should be used in cases like this?
update From the answers it looks like this is something that will be fixed up in the compiler.
In the mean time, there were two suggested work arounds. Both approaches are fairly straightforward, though they do start to feel like the sort of jiggery-pokery that you have to use in C++ (though not on the same scale of awkward).
The first is the FastAnon package suggested by #Toivo Henningsson. I didn't try this approach, but it looks good.
I tried out the second method suggested by #simonstar, which gave me equivalent performance to the non-generic qsort! implementation:
abstract OrderingOp
immutable AscendingOp<:OrderingOp end
immutable DescendingOp<:OrderingOp end
evaluate(::AscendingOp, x, y) = x<y
evaluate(::DescendingOp, x, y) = x>y
function qsort_generic!(a,lo,hi,op=AscendingOp())
i, j = lo, hi
while i < hi
pivot = a[(lo+hi)>>>1]
while i <= j
while evaluate(op, a[i], pivot); i += 1; end
while evaluate(op, pivot, a[j]); j -= 1; end
if i <= j
a[i], a[j] = a[j], a[i]
i, j = i+1, j-1
end
end
if lo < j; qsort_generic!(a,lo,j,op); end
lo, j = i, hi
end
return a
end
Thanks everyone for the help.
It's a problem and will be fixed with an upcoming type system overhaul.
Update: This has now been fixed in the 0.5 version of Julia.
As others have noted, the code you've written is idiomatic Julia and will someday be fast, but the compiler isn't quite there yet. Besides using FastAnonymous, another option is to pass types instead of anonymous functions. For this pattern, you define an immutable with no fields and a method (let's call it evaluate) that accepts an instance of the type and some arguments. Your sorting function would then accept an op object instead of a function and call evaluate(op, x, y) instead of op(x, y). Because functions are specialized on their input types, there is no runtime overhead to the abstraction. This is the basis for reductions and specification of sort order in the standard library, as well as NumericExtensions.
For example:
immutable AscendingSort; end
evaluate(::AscendingSort, x, y) = x < y
function qsort_generic!(a,lo,hi,op=AscendingSort())
i, j = lo, hi
while i < hi
pivot = a[(lo+hi)>>>1]
while i <= j
while evaluate(op, a[i], pivot); i += 1; end
while evaluate(op, pivot, a[j]); j -= 1; end
if i <= j
a[i], a[j] = a[j], a[i]
i, j = i+1, j-1
end
end
if lo < j; qsort_generic!(a,lo,j,op); end
lo, j = i, hi
end
return a
end
Yes, it's due to limitations in the compiler, and there are plans to fix it, see e.g. this issue. In the meantime, the FastAnonymous package might provide a workaround.
The way that you have done it looks pretty idiomatic, there's unfortunately no magic trick that you are missing (except for possibly the FastAnonymous package).
i try to find the complexity of this algorithm:
m=0;
i=1;
while (i<=n)
{
i=i*2;
for (j=1;j<=(long int)(log10(i)/log10(2));j++)
for (k=1;k<=j;k++)
m++;
}
I think it is O(log(n)*log(log(n))*log(log(n))):
The 'i' loop runs until i=log(n)
the 'j' loop runs until log(i) means log(log(n))
the 'k' loop runs until k=j --> k=log(i) --> k=log(log(n))
therefore O(log(n)*log(log(n))*log(log(n))).
The time complexity is Theta(log(n)^3).
Let T = floor(log_2(n)). Your code can be rewritten as:
int m = 0;
for (int i = 0; i <= T; i++)
for (int j = 1; j <= i+1; j++)
for (int k = 1; k <= j; k++)
m++;
Which is obviously Theta(T^3).
Edit: Here's an intermediate step for rewriting your code. Let a = log_2(i). a is always an integer because i is a power of 2. Then your code is clearly equivalent to:
m=0;
a=0;
while (a<=log_2(n))
{
a+=1;
for (j=1;j<=a;j++)
for (k=1;k<=j;k++)
m++;
}
The other changes I did were naming floor(log_2(n)) as T, a as i, and using a for loop instead of a while.
Hope it's clear now.
Is this homework?
Some hints:
I'm not sure if the code is doing what it should be. log10 returns a float value and the cast to (long int) will probably cut of .9999999999. I don't think that this is intended. The line should maybe look like that:
for (j=1;j<=(long int)(log10(i)/log10(2)+0.5);j++)
In that case you can rewrite this as:
m=0;
for (i=1, a=1; i<=n; i=i*2, a++)
for (j=1; j<=a; j++)
for (k=1; k<=j; k++)
m++;
Therefore your complexity assumption for the 'j'- and 'k'-loop is wrong.
(the outer loop runs log n times, but i is increasing until n, not log n)