Locality of function - caching

Does this function have good locality with respect to array a? Justify your answer by calculating the average miss and hit rates if the array size is 10 times larger than the cache.
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0;
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}

Probably not.
In the inner loop you increment i what results in accessing memory locations separated by N*sizeof(int) bytes. The ratio of cache misses will depend on constant N and size of cache. The size (understood as sizeof a) will probably have no impact on cache effectiveness.
Moreover, the CPUs often do speculative prefetching of memory by tracing memory access pattern of your program. Therefore there might be little cache misses even though a small protion of cache is actually used. The exact answer must involve benchmarking on specific architecture.
To make thing more 'interesting', nowadays compiler will reorder the loop and improve locality automatically. Therefore I think that there is not enough data to provide a reliable answer to your question.

Related

What counts as an operation in algorithms?

So I've just started learning algorithms and data structures, and I've read about Big O and how it portrays complexity of algorithms based on how the number of operations required scales
But what actually counts as an operation? In this bubble sort, does each iteration of the for loop count as an operation, or only when an if statement is triggered, or all of them?
And since there are so many different algorithms of all kinds, how do you immediately identify what would count as an "operation" happening in the algorithm's code?
function bubbleSort(array) {
for (let i = 0; i < array.length; i++) {
for (let j = 0; j < array.length; j++) {
if (array[j + 1] < array[j]) {
let tmp = array[j]
array[j] = array[j+1]
array[j+1] = tmp
}
}
}
return array
}
You can count anything as an operation that will execute within a constant amount of time, independent of input. In other words, operations that have a constant time complexity.
If we assume your input consists of fixed-size integers (like 32-bit, 64 bit), then all of the following can be considered such elementary operations:
i++
j < array.length
array[j + 1] < array[j]
let tmp = array[j]
...
But that also means you can take several of such operations together and still consider them an elementary operation. So this is also an elementary operation:
if (array[j + 1] < array[j]) {
let tmp = array[j]
array[j] = array[j+1]
array[j+1] = tmp
}
So, don't concentrate on breaking down operations into smaller operations, and those again into even smaller operations, when you are already certain that the larger operation is O(1).
Usually, everything that happens is a single operation. This is one of the reason we don't actually count the exact number of them, but instead use asymptotic notations (big O and big Theta).
However, sometimes you are interested about one kind of operation only. A common example is algorithms that use IO. Since IO is significantly more time consuming than anything happening on the CPU, you often just "count" the number of IO operations instead. In these cases, you often actually care about exact number of times an IO occurs, and can't use only asymptotic notations.

Do variables declared in loop make space complexity O(N)?

Would variables declared inside of a for loop that loops N times make the space complexity O(N) even though those variables fall out of scope each time the loop repeats?
for(var i = 0; i < N; i++){
var num = i + 5;
}
Would variables declared inside an O(N) for loop make the space complexity O(N)
No, since variables go out of scope at the end of every iteration, thus they are destroyed.
As a result the space complexity remains constant, i.e. O(1).
1 (fixed-size) variable that you change n times (which could include unallocating and reallocating it) is still just 1 variable, thus O(1) space.
But this may possibly be somewhat language-dependent - if some language (or compiler) decides to keep all of those earlier declarations of the variable in memory, that's going to be O(n), not O(1).
Consider, for example, two ways of doing this in C++:
for (int i = 0; i < N; i++)
int num = i + 5;
for (int i = 0; i < N; i++)
int* num = new int(i + 5);
In the former case, the variable can be reused and it will be O(1).
When you use new, that memory will not be automatically freed, so every iteration in the latter case will assign more memory instead of reusing the old (technically the pointer will get reused, but what it pointed to will remain), thus it will use O(n) space. Doing this is a terrible idea and will be a memory leak, but it's certainly possible.
(I'm not too sure what the C++ standard says about what compilers are or are not required to do in each case, this is mostly just meant to show that this type of in-loop assignment is not necessarily always O(1)).
No, it remains O(1) as explained below:
for(var i = 0; i < N; i++){
var num = i + 5; //allocate space for var `num`
} // release space acquired by `num`

Space complexity of nested loop

I am confused when it comes to space complexity of an algorithm. In theory, it corresponds to extra stack space that an algorithm uses i.e. other than the input. However, I have problems pointing out, what exactly is meant by that.
If, for instance, I have a following brute force algorithm that checks whether there are no duplicates in the array, would that mean that it uses O(1) extra storage spaces, because it uses int j and int k?
public static void distinctBruteForce(int[] myArray) {
for (int j = 0; j < myArray.length; j++) {
for (int k = j + 1; k < myArray.length; k++) {
if (k != j && myArray[k] == myArray[j]) {
return;
}
}
}
}
Yes, according to your definition (which is correct), your algorithm uses constant, or O(1), auxilliary space: the loop indices, possibly some constant heap space needed to set up the function call itself, etc.
It is true that it could be argued that the loop indices are bit-logarithmic in the size of the input, but it is usually approximated as being constant.
According to the Wikipedia entry:
In computational complexity theory, DSPACE or SPACE is the computational resource describing the resource of memory space for a deterministic Turing machine. It represents the total amount of memory space that a "normal" physical computer would need to solve a given computational problem with a given algorithm
So, in a "normal" computer, the indices would be considered each to be 64 bits, or O(1).
would that mean that it uses O(1) extra storage spaces, because it uses int j and int k?
Yes.
Extra storage space means space used for something other then the input itself. And, just as time complexity works, if that extra space is not dependent (increases when input size is increased) on the size of the input size itself, then the space complexity would be O(1)
Yes, your algorithm is indeed O(1) storage space 1, since the auxillary space you use has a strict upper bound that is independent on the input.
(1) Assuming integers used for iteration are in restricted range, usually up to 2^32-1

Efficiency of nested for-loops with vastly different counts

Given a is much larger than b, would
for (i = 0; i < a; i++)
for (k = 0 k < b; k++)
be faster than
for (i = 0; i < b; i++)
for (k = 0 k < a; k++)
It feels to me the former would be faster but I cannot seem to get my head around this.
Well it really depends on what your doing. It's hard to do runtime analysis without knowing what's being done. That being said, if your using this code to traverse through a large array, its more important to go through each column in each row rather than visa-versa.
[0][1][2]
[3][4][5]
[6][7][8]
is really [0][1][2][3][4][5][6][7][8] in memory.
Your computer's cache provides a greater advantage when memory access is close together, and going sequentially though memory rather than skipping through rows provide much more locality.
Starting a loop takes effort; there's the loop variable itself plus al the variables declared within the loop, which are all allocated memory and pushed onto the stack.
This means the fewer times you enter a loop the better, so loop over the smaller range in the outer loop.

Why is my Strassen Matrix multiplier so fast?

As an experiment I implemented the Strassen Matrix Multiplication Algorithm to see if truly lead to faster code for large n.
https://github.com/wcochran/strassen_multiplier/blob/master/mm.c
To my surprise it was way faster for large n. For example, the n=1024 case
took 17.20 seconds using the conventional method whereas it only took 1.13 seconds
using the Strassen method (2x2.66 GHz Xeon). What -- a 15x speedup!? It should only be marginally faster. In fact, it seemed to be as good for even small 32x32 matrices!?
The only way I can explain this much of a speed-up is that my algorithm is more cache-friendly -- i.e., it focuses on small pieces of the matrices and thus the data is more localized. Maybe I should be doing all my matrix arithmetic piecemeal when possible.
Any other theories on why this is so fast?
The recursive nature of Strassen has better memory locality,
so that may be a part of the picture. A recursive regular
matrix multiplication is perhaps a reasonable thing
to compare to.
First question is "are the results correct?". If so, it's likely that your "conventional" method is not a good implementation.
The conventional method is not to use 3 nested FOR loops to scan the inputs in the order you learned in math class. One simple improvement is to transpose the matrix on the right so that it sits in memory with columns being coherent rather than rows. Modify the multiply loop to use this alternate layout and it will run much faster on a large matrix.
The standard matrix libraries implement much more cache friendly methods that consider the size of the data cache.
You might also implement a recursive version of the standard matrix product (subdivide into 2x2 matrix of matricies that are half the size). This will give something closer to optimal cache performance, which strassen gets from being recursive.
So either you're doing it wrong, or your conventional code is not optimized.
What is the loop order in your conventional multiplication? If you have
for (int i = 0; i < new_height; ++i)
{
for (int j = 0; j < new_width; ++j)
{
double sum = 0.0;
for (int k = 0; k < common; ++k)
{
sum += lhs[i * common + k] * rhs[k * new_width + j];
}
product[i * new_width + j] = sum;
}
}
then you're not being very nice to the cache because you're accessing the right hand side matrix in a non-continuous manner. After reordering to
for (int i = 0; i < new_height; ++i)
{
for (int k = 0; k < common; ++k)
{
double const fixed = lhs[i * common + k];
for (int j = 0; j < new_width; ++j)
{
product[i * new_width + j] += fixed * rhs[k * new_width + j];
}
}
}
access to two matrices in the inner-most loop are continuous and one is even fixed. A good compiler would probably do this automatically, but I chose to explicitly pull it out for demonstration.
You didn't specify the language, but as for C++, advanced compilers even recognize the unfriendly loop order in some configurations and reorder them.

Resources