cpumemory.pdf - cache optimized matrix multiplication - caching

I'm reading cpumemory.pdf from
Ulrich Drepper and I'm unable to understand following part about optimizing
cache access in matrix multiplication from chapter 6.2.1 (page 49-50):
First naive method for matrix multiplication is shown:
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += mul1[i][k] * mul2[k][j];
mul2 is accessed by columns so for each column one cache line is wasted. Ulrich says:
With sizeof(double) being 8 this means that, to fully utilize the cache line,
we should unroll the middle loop 8 times.
For brevity I unrolled middle loop only 2 times.
for (i = 0; i < N; ++i)
for (j = 0; j < N; j += 2)
for (k = 0; k < N; ++k) {
res[i][j+0] += mul1[i][k] * mul2[k][j+0];
res[i][j+1] += mul1[i][k] * mul2[k][j+1];
}
Now it's obvious that if cache line is 2 double values wide it'll be fully
utilized. But then Ulrich continues:
Continuing this thought, to effectively use the res matrix as well, i.e., to
write 8 results at the same time, we should unroll the outer loop 8 times as
well.
For brevity I unrolled outer loop only 2 times again.
for (i = 0; i < N; i += 2)
for (j = 0; j < N; j+=2)
for (k = 0; k < N; ++k) {
res[i+0][j+0] += mul1[i+0][k] * mul2[k][j+0];
res[i+0][j+0] += mul1[i+0][k] * mul2[k][j+0];
res[i+1][j+0] += mul1[i+1][k] * mul2[k][j+0];
res[i+1][j+1] += mul1[i+1][k] * mul2[k][j+1];
}
To me it seems even worse than previous version because now mul1 is accessed
by columns. Please explain what Ulrich meant.

There are three matrices inside the cache: the left input, the right input and the result.
The left input is accessed just fine by original code because it is row-major, and the innermost loop increments k, so it marches down the cache line.. the second matrix is accessed well by the single unrolling, because now all the columns in a cache line are used before the cache line is evicted..
The question is the result matrix.. it is also row-major, but the cache line is indexed by j, not by k.. and you are right.. j has already been unrolled, so it uses all the elements on a cache line within the result matrix.. so there doesn't appear to be anything gained by the second unrolling.. all it does is add two extra cache lines.. an extra for the left matrix and an extra for the result matrix! It doesn't improve the coverage of elements of any cache lines!
However, it does happen to reuse the right matrix's cache line twice.. that reduces the total number of times the lines of the right matrix have to be brought in.. and it does not increase the number of times the left and right matrix cache lines will be brought in.. so perhaps that reuse of the entire line is where the advantage comes from.. I guess the question is whether this is properly blocked to the cache size, and what the set associativity of the cache is.. if all lines of all three matrices stay in the cache, then this has no advantage.. (but it doesn't make anything worse!)

Related

How to determine the number of steps on a nested for loop?

I've been working on this algorithm question for a while now and haven't made much progress. The given code is
for(i = n; i > 0; i = i-4){
for(j = i; j < n; j++){
...
}
}
The goal is to determine the theta-runtime of the nested loop. My main issue came when trying to write out the values of i at each iteration. I figured on the first iteration i=n, then on the second iteration it will be i=n-4, and then i=n-8 and so on, but I get confused in determining what the last value of i should be(and more importantly, what the last iteration of the outerloop would be). I spoke with a friend who suggested that the total number of outerloop iterations should be the ceiling of n/4, which seems to make sense, but I don't know how to verify that. Does anyone have any idea how to approach this kind of problem?
I'm assuming you wanted to decrease the value of 'i' by 4 in every iteration.
But instead of decreasing, you're increasing it.
Try like following
for(i = n; i > 0; i = i-4){
for(j = i; j < n; j++){
...
}
}
"total number of outerloop iterations should be the ceiling of n/4" - it's true. As you're decreasing the value of 'i' by 4.
Think about it like this, when you're decreasing the value by 1, it'll run for 'n' times. When you'll decrease by 2, it'll run for n / 2 times. And so on...
To verify the answer, you can keep a counter in your code, then print the value of the count at the end of the program.
About determining the last of 'i':
The last value of 'i' will be from 1 to 4 for which your code will get into the inner loop. (because you're running the loop as long as i > 0 ).
int outerLoopCount = 0;
for( i = n; i > 0; i = i-4)
{
outerLoopCount++;
for(j = i; j < n; j++)
{
}
}
printf("outer loop count: %d\n", outerLoopCount);

Calculating Big "O" for the following example

Let's say I have a following code sample:
int number;
for(int i = 0; i < A; i++)
for(int j = 0; j < B; j++)
if(i == j) // some condition...
do{
number = rand();
}while(number > 100);
I would like to know the Big "O" for this example. Outer loops are O(A * B), but I'm not sure what to think about the do-while loop and it's Big "O". In the worst case scenario it can be an infinite loop and in the best case O(1) and ignored.
Edit: updated condition inside the if statement (replaced function call with a simple comparison).
While rand() is a random function and it has a specified range of output, we can say the do while statement is O(1).
So, it depends on the someCondition() function.
Total complexity is O(A * B) * O(someCondition).

How to openmp parallelize for loop that increments two variables

As a first step into OpenMP I set myself a challenge to parallelize some matrix decomposition algorithm. I picked Crout with pivoting, source can be found here:
http://www.mymathlib.com/c_source/matrices/linearsystems/crout_pivot.c
At the bottom of that decomposition function there's an outer for loop that walks over i and p_row at the same time. Of course OpenMP is as confused as I am when looking at this and refuses to do anything with it.
After wrapping my mind around it I think I got it untangled into readable form:
p_row = p_k + n;
for (i = k+1; i < n; i++) {
for (j = k+1; j < n; j++) *(p_row + j) -= *(p_row + k) * *(p_k + j);
p_row += n;
}
At this point serial run still comes up with the same result as the original code.
Then I add some pragmas, like this:
p_row = p_k + n;
#pragma omp parallel for private (i,j) shared (n,k,p_row,p_k)
for (i = k+1; i < n; i++) {
for (j = k+1; j < n; j++) *(p_row + j) -= *(p_row + k) * *(p_k + j);
#pragma omp critical
p_row += n;
#pragma omp flush(p_row)
}
Yet the results are essentially random.
What am I missing?
I haven't tested your adaptation of original code, but your program has several problems.
#pragma omp parallel for private (i,j) shared (n,k,p_row,p_k)
Default behavior is to have vars declared outside of scope shared, so the shared declaration is useless.
But these var should not be shared and rendered private.
n is unchanged during iterations, so better have a local copy
ditto for k and p_k
p_row is modified, but you really want several copies of p_row. This what will insure a proper parallel processing, so that each thread processes different rows. The problem is to compute p_row value in the different threads.
In the outer loop, iteration 0 will use p_row, second iteration p_row+n, iteration l p_row+l*n. Your iterations will be spread over several threads. Assume each thread processes m iterations. Thread 0 will process i=k+1 to i=m+(k+1) and p_row to p_row+m*n, thread 1 i=m+1+(k+1) to i=2m+(k+1) and p_row+n*(m+1) to p_row+2*m*n, etc. Hence you can compute the value that should have p_row at the start of the loop with the value of i.
Here is a possible implementation
p_row = p_k + n;
#pragma omp parallel for private(i,j) firstprivate(n, k, p_row, p_k)
// first private insures initial values are kept
{
for (i = k+1, p_row=p_row+(i-(k+1))*n; i < n; i++, p_row += n) {
for (j = k+1; j < n; j++)
*(p_row + j) -= *(p_row + k) * *(p_k + j);
}
p_row incrementation is in the for loop. This should continue to work in a sequential environment.
Critical is useless (and was buggy in your previous code). Flush is implicit at the end of a parallel section (and the pragma is just "omp flush").

openmp parallelized code in c++ is slower than the serial one

I would like to parallelise the following code
for (i=0; i<M; i++){
temp = 0.
for (j=0; j<N; j++){
temp += A[i][j]*v[j];
u[i] = temp;
}
where different u[i] can be computed independently. So I tried to do
#pragma omp parallel for private (j,temp)
for (i=0; i<M; i++){
temp = 0.
for (j=0; j<N; j++){
temp += A[i][j]*v[j];
u[i] = temp;
}
and I find that the second case is more slow than the first one. Any idea why it is so? here M ~ 100 and N ~ 2.
There is no benefit to parallelizing this because the data are so small: the overhead of setting up and coordinating a parallel computation outweighs any benefit of parallel execution.
Further, the calculation is easily optimized.
Your compiler will most certainly unroll the inner loop (where N=2), so that it becomes:
u[i] = A[i][0]*v[0] + A[i][1] * v[1]
Even if we assume A is a 64bit type (like double), then the outer loop over i can be unrolled 4 times, such that each access to A uses a single 64 byte cache line (for the 8 entries A[i][0] through A[i+3][1]).
The result is a single loop with only 25 iterations (for M=100):
for (i=0; i<M; i+= 4)
{
u[i] = A[i][0]*v[0] + A[i][1] * v[1];
u[i+1] = A[i+1][0]*v[0] + A[i+1][1] * v[1];
u[i+2] = A[i+2][0]*v[0] + A[i+2][1] * v[1];
u[i+3] = A[i+3][0]*v[0] + A[i+3][1] * v[1];
}
With prefetching, all accesses are basically already in cache.
You would need much bigger data to see any benefit from parallelizing this :)

On understanding how to compute the Big-O of code snippets

I understand that simple statements like:
int x = 5; // is 1 or O(1)
And a while loop such as:
while(i<); // is n+1 or O(n)
And same with a for a single for loop (depending).
With nested while or for loop such as:
for(int i = 0; i<n; i++){ // this is n + 1
for(int j = 0; j<n; j++){ // this is (n+1)*n, total = O(n^2)
}
Also anytime we have a doubling effect it's log_2(n), tripling effect log_3(n) and so on. And if the control varible is being halved or quarted that's also either log_2(n) or log_4(n).
But I am dealing with much more complicated examples. How would one figure these examples out. I have the answers I just don't know how to work them out on paper come an examination.
Example1:
for (i = 1; i < (n*n+3*n+17) / 4 ; i += 1)
System.out.println("Sunshine");
Example2:
for (i = 0; i < n; i++)
if ( i % 2 == 0) // very confused by what mod would do to runtime
for (j = 0; j < n; j++)
System.out.print("Bacon");
else
for (j = 0; j < n * n; j++)
System.out.println("Ocean");
Example3:
for (i = 1; i <= 10000 * n: i *= 2)
x += 1;
Thank you
Example 1 is bounded by the term (n*n+3*n+17) and therefore should be O(n^2). The reason this is O(n^2) is because the largest, and therefore dominant, term in the expression is n^2.
The second example is a bit more tricky. The outer loop in i will iterate n times, but what executes on the inside depends on whether that value of i be odd or even. When even, another loop over n happens, but when odd a loop in n^2 happens. The odd case will dominate the running time eventually, so example 2 should be O(n^3).
The third example iterates until hitting 10000*n, but does so by doubling the loop counter i at each step. This will have an O(lgn) performance, where lg means the log base 2. To see why, imagine we wanted to reach n=32, starting at i=1 and doubling each time. Well we would have 2, 4, 8, 16, and 32, i.e. 6 steps, which grows as lg(32).

Resources