Identifying cache memory levels with Fortran code - performance

I have a Fortran code to test the Cache memory of my computer. The code basically compares the time needed to compute the sum of 3 matrices A, B and C of size NxN. N is given as input and all the elements of the matrices are 1.0 .
The sum is made by columns and by lines to compare the Gflops (bearing in mind that in Fortran, the memory loads matrices by columns and not by lines, so: A(1,1) , A(2,1), A(3,1)...).
The for loop is repeated several times in order to compute the average time of the sum. The sum is stored in the same matrix A, so A=A+B+C.
The characteristics of my cache memory are:
hw.l3cachesize: 3145728
hw.l2cachesize: 262144
hw.l1dcachesize: 32768
hw.l1icachesize: 32768
hw.cachelinesize: 64
Since the elements of the matrices are of type real*8, I am expecting to find a decrease in the Gflops when going beyond the cache L1 size.
Actually I also would like to know if I should expect the change at 32 kB or 64 kB, since the L1i is 32 kB and the L1d also 32 kB.
For that I assume that the size stored in the cache in bytes is 3*N*N*8 (3 matrices and 8 bytes per element).
When I check my results I am not able to identify this change in Gflops.

Related

How to count # of cache misses in theory for a matrix in memory exceeding cache size?

I'm currently considering an n x n matrix M of 64-bit integer elements stored in main memory in row-major order. I have an L1 data cache of 16KB split in 64B blocks (no L2 or L3). My code is meant to print out each element of the array one at a time, by either traversing the matrix in row-first order or column-first order.
In the case where n = 16 (i.e. 16 x 16 matrix), I've counted 0 cache misses using both row-first order and column-first order since the matrix M fits entirely in the 16KB cache (it never needs to jump to main memory to fetch an element). How would I deal with the case of, say, n = 256 (256 x 256 matrix of 64-bit ints); i.e. when M doesn't fully fit in the cache? Do I count all the ints that don't fit as misses, or can spatial locality be leveraged somehow? Assume the cache is initially empty.
The "0 cache misses" seems to assume you start out with M already in cache. That's already a bit suspicious, but OK.
For the 256x256 case, you need to simulate how the cache behaves. You must have cache misses to bring in the missing entries. Each cache miss brings in not just the requested int, but also 7 adjacent ints.

times of the Fortran matmul function with different multiplication sizes

I have calculated the time spent by the Fortran's MATMUL function with different multiplication sizes (32 × 32, 64 × 64, ...) and I have questions about the results.
These are the results:
SIZE ----- TIME IN SECONDS
32 ----- 0,000071
64 ----- 0,000032
128 ----- 0,001889
256 ----- 0,010866
512 ----- 0,043
1024 ----- 0,336
2048 ----- 2,878
4096 ----- 51,932
8192 ----- 405,921856
I guess the times should increase by a factor of 8 (m * 2 * n * 2 * k * 2). I do not know if it should be like that. If so, who can tell why it is not like that?
In addition, we see an increase of a factor of 18 with multiplications of 2048 a
4096. Could someone tell me why?
I have measured the times with CALL CPU_TIME() from Fortran and with CALL DATE_AND_TIME() from Fortran and both give very similar results.
My processor is an AMD Phenom (tm) II X4 945 Processor with 4 cores
#Steve is correct, there are many factors that affect performance especially when data sizes are small. Thats why all of your results at and below 2048 are pretty much semi-random and essentially irrelevant. All or most of the data is likely in several layers of CPU cache. So flushing CPU threads and other hardware related events are making these results very skewed. If you run these tests again you will find different results at these small sizes.
So, when you go from 2048 to 4096 you get a major jump. All the data no longer fits into the CPU caches. The computer needs to load blocks of data from RAM into the CPU caches. This explains the large jump in time.
It is at these sizes and larger that the computer has to do more typical operations (load data, perform operations, save data to RAM) and this is the performance you will get as data gets even larger. This is also where performance becomes very consistent as data grows larger. Notice that going from 4096 to 8192 is very close to exactly 8 times longer. At this point, going to 16384 will take almost exactly 8 times 406 seconds.
Any size smaller than 4096 is not giving your computer enough work to accurately measure the performance.
There should be a factor 8 between each timing, and deviations are generally due to memory management like cache alignment and cache- vs array-size. For small arrays there might be a calling overhead to matmul(). A triple do-loop can be faster, at least with some optimization (try -O3 -march=native), and should work equally well for small sizes.

What is overhead percentage?

Consider a 2KB direct mapped cache with blocks of size 1 word. As
always, addresses are 32 bits.
How many blocks does the cache contain? 2^7
How many bits long is each tag? (Tags are shown in pink in the class
notes.) 2^23
How many bits long is each cache index? (These are green in the notes)
2^7
What is the total size of the cache? (32 + 1+ 23) x 2^7
What percentage of the total size is the overhead?
what is .. overhead .. and percentage of overhead.. ?
overhead is tag size, and any other bits the cache needs to store other than the data itself.
(e.g. for an associative cache with LRU replacement, it would need to store some bits that record the LRU state to track which member of the set is next in line for eviction.)
overhead percentage is obviously overhead / total size, as the assignment says. (not overhead / data).

Calculating number of bits in a cache

Preface: There are many different design patterns that are important to cache's overall performance. Below are listed parameters for
different direct-mapped cache designs.
Cache data size: 32 kib
Cache block Size: 2 words
Cache access time: 1-cycle
Question: Calculate the number of bits required for the cache listed above, assuming a 32-bit address. Given that total size, find the
total size of the closest direct-mapped cache with 16-word blocks of
equal size or greater. Explain why the second cache, despite its
larger data size, might provide slower performance that the first
cache.
Here's the formula:
Number of bits in a cache 2^n X (block size + tag size + valid field size)
Here's what I got: 65536(1+14X(32X2)..
is this correct?
using: (2^index bits) * (valid bits + tag bits + (data bits * 2^offset bits))
for the first one i get:
total bits = 2^15 (1+14+(32*2^1)) = 2588672 bits
for the cache with 16 word blocks i get:
total bits = 2^13(1 +13+(32*2^4)) = 4308992
the next smallest cache with 16 word blocks and a 32 bit address works out to be 2158592 bits, smaller than the first cache.
I'm stuck on the same problem too but I have the answer to the first part.
To calculate the total number of bits required
You need to convert the KB to words and get the index bits.
Use the answer from part 1 to get your tag bits.
Plug them into this formula.
(2^(index bits)) * ((tag bits)+(valid bits)+(data size))
Hint: data size is 64 bits in this case and valid bit is 1. So just find the index and tag bits.
And I don't think your answer is right. I didn't check but I can see you are multiplying 1+14 and (32x2) instead of adding them.
I think the formula you were using is correct. According to my textbook "Computer Organization and Design The Hardware, 5th edition", the total number of bits in a direct-mapped cache is:
2^indext bits * (block size + tag size + valid field size).
block size was given by the question: 2 words = 32 bits
tag size: 32 - offset in bits - index in bits
valid field size is usually 1 valid bit

How to compute the time of an in-place external merge sort?

The original problem is like this:
You are to sort 1PB size of integers ranging from -2^31 ~ 2^31 - 1 (int), you have 1024 machines each having 1TB disk space and 16GB memory space. Assume disk speed is 128MB/s (r/w) and memory speed is 8GB/s (r/w). Time for CPU can be ignored. Network transfer time can be ignored for simplicity. Compute the approximated time needed.
I know with external sort we can sort the 1TB data on a single machine in roughly 10hrs as computed like this:
Disk access (2r2w): 1T * 4 / 128MB/s = 2 ^ 15 sec ~ 9 hrs
Mem access:
sorting 2^48 Integers in 64 parts (2 ^ 42 each) roughly takes 1.3 min each. So totally 1.4 hr.
63 way merging takes several seconds, and thus is ignored.
But what about the next step: the combination of 1024T data? I have no idea how this is computed. So any help please?
2^31 is = 2 billion (2 "giga"). So you are looking at lot of duplicate numbers and fixed range. So consider Radix Sort ( http://en.wikipedia.org/wiki/Radix_sort ).
Each processor, for a subset od data) creates 'count' array (x[0] contains the count of 0s etc). Then you can merge all results into one array. Later you can "construct" the sorted array.

Resources