in the book "Using OpenMP" is an example for bad memory access in C and I think this is the main problem in my attempt to parallelism the gaussian algorithm.
The example looks something like this:
k= 0 ;
for( int j=0; j<n ; j++)
for(int i = 0; i<n; i++)
a[i][j] = a[i][j] - a[i][k]*a[k][j] ;
So, I do understand why this causes a bad memory access. In C a 2d array is stored by rows and here in every i step a new row will be copied from memory to cache.
I am trying to find a solution for this, but im not getting a good speed up. The effects of my attempts are minor.
Can someone give me a hint what I can do?
The easiest way would be to swap the for loops, but I want to do it columnwise.
The second attempt:
for( int j=0; j<n-1 ; j+=2)
for(int i = 0; i<n; i++)
{
a[i][j] = a[i][j] - a[i][k]*a[k][j] ;
a[i][j+1] = a[i][j+1] - a[i][k]*a[k][j+1] ;
}
didn't make a difference at all.
The third attempt:
for( int j=0; j<n ; j++)
{
d= a[k][j] ;
for(int i = 0; i<n; i++)
{
e = a[i][k] ;
a[i][j] = a[i][j] - e*d ;
}
}
Thx alot
Greets Stepp
use flat array instead, eg:
#define A(i,j) A[i+j*ldA]
for( int j=0; j<n ; j++)
{
d= A(k,j) ;
...
}
Your loop order will cause a cache miss on every iteration, as you point out. So just swap the order of the loop statements:
for (int i = 0; i < n; i++) // now "i" is first
for (int j = 0; j < n; j++)
a[i][j] = a[i][j] - a[i][k]*a[k][j];
This will fix the row in a and vary just the columns, which means your memory accesses will be contiguous.
This memory access problem is just related to CACHE usage not to Openmp.
To make a good use of cache in general you should access contiguous memory locations. Remember also that if two or more threads are accessing the same memory area then you can have a "false shearing" problem forcing cache to be reloaded unnecessarily.
See for example:
http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/
Related
I'm trying to write a code for matrix multiplication. As far as I understand OMP and pararel programming this code may suffer from race condition.
#pragma omp parallel
#pragma omp for
for (int k = 0; k < size; k++){
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
c[i][j] += a[i][k] * b[k][j];
}}}
Do I get rid of it if I put #pragma omp atomic before writing to c matrix or by adding private(i) to 2nd #pragma? Also is it possible to make this code false-sharing free? If yes, how ?
A race condition occurs when 2 or more threads access the same memory location and at least one of them is writing it. Line c[i][j] +=... can cause data race in your code. The solution is to reorder your nested loops (use the order of i,j,k) and you may introduce a temporary variable to calculate the dot product:
#pragma omp parallel for
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
double tmp=0; // change its type as needed
for (int k = 0; k < size; k++){
tmp += a[i][k] * b[k][j];
}
c[i][j] = tmp; //note that += was used in your original code
}
}
Note that your code will be faster if you calculate the transpose of matrix b. For more details read this.
UPDATE:
If you need to maintain the order of loops, there are 2 possibilities (but these solutions may be slower than the serial code):
Use atomic operation (i.e #pragma omp atomic). In this case false sharing also can be a problem.
If your stack is large enough to store the matrix for all threads, a better alternative is to use reduction: #pragma omp parallel for reduction(+:c[:size][:size]) (Another alternative is to do the reduction manually. In this case you can allocate the matrices used for reduction on the heap.)
MaximumPairwiseProduct
for(int i =0; i < n; i++){
for (int j =i+1; j< n; j++){
}
}
so I was wondering what will be the value of j at the first step when i = n-1. Thank you
When the processor computes i as n-1 value, j doesn't exist at all. The reason of that is j is not in current scope, it's initiated when int j =i+1 is executed.
I have been trying to get better awareness of cache locality. I produced the 2 code snippets to gain better understanding of the cache locality characteristics of both.
vector<int> v1(1000, some random numbers);
vector<int> v2(1000, some random numbers);
for(int i = 0; i < v1.size(); ++i)
{
auto min = INT_MAX;
for(int j = 0; j < v2.size(); ++j)
{
if(v2[j] < min)
{
v1[i] = j;
}
}
}
vs
vector<int> v1(1000, some random numbers);
vector<int> v2(1000, some random numbers);
for(int i = 0; i < v1.size(); ++i)
{
auto min = INT_MAX;
auto loc = -1;
for(int j = 0; j < v2.size(); ++j)
{
if(v2[j] < min)
{
loc = j;
}
}
v1[i] = loc;
}
In the first code v1 is being updated directly inside the if statement. Does this cause cache locality issues because during the update it'll replace the current cache line with some contiguous segment of data from v1[i] to v1[cache_line_size/double_size]? If this analysis is correct, then this would seem to slow down the loop over j, since for each iteration of the j loop, it'll likely have cache misses. It seems the second implementation alleviates this issue by using a local variable and not updating v1[i] until after the j loop is complete?
In practice, I think the compiler might optimize the cache locality issue away? For discussion, how about we assume no compiler optimizations.
I am trying to learn Big(O) notation. While searching for some articles online, I came across two different articles , A and B
Strictly speaking in terms of loops - it seems that they almost have the same kind of flow.
For example
[A]'s code is as follows (its done in JS)
function allPairs(arr) {
var pairs = [];
for (var i = 0; i < arr.length; i++) {
for (var j = i + 1; j < arr.length; j++) {
pairs.push([arr[i], arr[j]]);
}
}
return pairs;
}
[B]'s code is as follows (its done in C)- entire code is here
for(int i = 0; i < n-1 ; i++) {
char min = A[i]; // minimal element seen so far
int min_pos = i; // memorize its position
// search for min starting from position i+1
for(int j = i + 1; j < n; j++)
if(A[j] < min) {
min = A[j];
min_pos = j;
}
// swap elements at positions i and min_pos
A[min_pos] = A[i];
A[i] = min;
}
The article on site A mentions that time complexity is O(n^2) while the article on site B mentions that its O(1/2·n2).
Which one is right?
Thanks
Assuming that O(1/2·n2) means O(1/2·n^2), the two time complexity are equal. Remember that Big(O) notation does not care about constants, so both algorithms are O(n^2).
You didn't read carefully. Article B says that the algorithm performs about N²/2 comparisons and goes on to explain that this is O(N²).
I found Intel's performance suggestion on Xeon Phi on Collapse clause in OpenMP.
#pragma omp parallel for collapse(2)
for (i = 0; i < imax; i++) {
for (j = 0; j < jmax; j++) a[ j + jmax*i] = 1.;
}
Modified example for better performance:
#pragma omp parallel for collapse(2)
for (i = 0; i < imax; i++) {
for (j = 0; j < jmax; j++) a[ k++] = 1.;
}
I test both case in Fortran with similar code on regular CPU using GFortran 4.8, they both get correct result. Test using similar Fortran Code with later code does not pass for GFortran5.2.0 and Intel 14.0
But as far as I understand, the loop body for OpenMP should avoid "loop sequence dependent" variable, for this case is k, so why in the later case it can get correct result and even better performance?
Here's the equivalent code for the two approaches when using collapse clause. You could see the second one is better.
for(int k=0; k<imax*jmax; k++) {
int i = k / jmax;
int j = k % jmax;
a[j + jmax*i]=1.;
}
for(int k=0; k<imax*jmax; k++) {
a[k]=1.;
}