How to write the loop with better cache behavior? - caching

I am working on a loop like this:
int arrA[BIG], arrB[BIG], arrC[BIG];
for(int = 0; i<BIG; i++){
do_operation(arrA[i], arrB[i], arrC[i]);
}
Here do_operation is not an actual function. It just means some operations between A,B,C.
From the profiling data, it looks like the cache missing is high.
How can I rewrite the loop with better cache behavior?
Thanks for any comment!

You are accessing each array linearly, which is essentially optimal for cache usage (and for the hardware prefetcher).
However, if your arrays are an unfortunate size (usually large powers of two), you will get thrashing; arrA[i], arrB[i] and arrC[i] will all map to the same cache line, and constantly evict each other. Essentially, every single access will be a cache miss. To avoid this, you should try padding each array slightly.
See e.g. Understanding cache thrashing.

Related

Faster memory allocation and freeing algorithm than multiple Free List method

We allocate and free many memory blocks. We use Memory Heap. However, heap access is costly.
For faster memory access allocation and freeing, we adopt a global Free List. As we make a multithreaded program, the Free List is protected by a Critical Section. However, Critical Section causes a bottleneck in parallelism.
For removing the Critical Section, we assign a Free List for each thread, i.e. Thread Local Storage. However, thread T1 always memory blocks and thread T2 always frees them, so Free List in thread T2 is always increasing, meanwhile there is no benefit of Free List.
Despite of the bottleneck of Critical Section, we adopt the Critical Section again, with some different method. We prepare several Free Lists as well as Critical Sections which is assigned to each Free List, thus 0~N-1 Free Lists and 0~N-1 Critical Sections. We prepare an atomic-operated integer value which mutates to 0, 1, 2, ... N-1 then 0, 1, 2, ... again. For each allocation and freeing, we get the integer value X, then mutate it, access X-th Critical Section, then access X-th Free List. However, this is quite slower than the previous method (using Thread Local Storage). Atomic operation is quite slow as there are more threads.
As mutating the integer value non-atomically cause no corruption, we did the mutation in non-atomic way. However, as the integer value is sometimes stale, there is many chance of accessing the same Critical Section and Free List by different threads. This causes the bottleneck again, though it is quite few than the previous method.
Instead of the integer value, we used thread ID with hashing to the range (0~N-1), then the performance got better.
I guess there must be much better way of doing this, but I cannot find an exact one. Are there any ideas for improving what we have made?
Dealing with heap memory is a task for the OS. Nothing guarantees you can do a better/faster job than the OS does.
But there are some conditions where you can get a bit of improvement, specially when you know something about your memory usage that is unknown to the OS.
I'm writting here my untested idea, hope you'll get some profit of it.
Let's say you have T threads, all of them reserving and freeing memory. The main goal is speed, so I'll try not to use TLS, nor critical blocking, not atomic ops.
If (repeat: if, if, if) the app can fit to several discrete sizes of memory blocks (not random sizes, so as to avoid fragmentation and unuseful holes) then start asking the OS for a number of these discrete blocks.
For example, you have an array of n1 blocks each of size size1, an array of n2 blocks each of size size2, an array of n3... and so on. Each array is bidimensional, the second field just stores a flag for used/free block. If your arrays are very large then it's better to use a dedicated array for the flags (due to contiguous memory usage is always faster).
Now, some one asks for a block of memory of size sB. A specialized function (or object or whatever) searches the array of blocks of size greater or equal to sB, and then selects a block by looking at the used/free flag. Just before ending this task the proper block-flag is set to "used".
When two or more threads ask for blocks of the same size there may be a corruption of the flag. Using TLS will solve this issue, and critical blocking too. I think you can set a bool flag at the beggining of the search into flags-array, that makes the other threads to wait until the flag changes, which only happens after the block-flag changes. With pseudo code:
MemoryGetter(sB)
{
//select which array depending of 'sB'
for (i=0, i < numOfarrays, i++)
if (sizeOfArr(i) >= sB)
arrMatch = i
break //exit for
//wait if other thread wants a block from the same arrMatch array
while ( searching(arrMatch) == true )
; //wait
//blocks other threads wanting a block from the same arrMatch array
searching(arrMatch) = true
//Get the first free block
for (i=0, i < numOfBlocks, i++)
if ( arrOfUsed(arrMatch, i) != true )
selectedBlock = addressOf(....)
//mark the block as used
arrOfUsed(arrMatch, i) = true
break; //exit for
//Allow other threads
searching(arrMatch) = false
return selectedBlock //NOTE: selectedBlock==NULL means no free block
}
Freeing a block is easier, just mark it as free, no thread concurrency issue.
Dealing with no free blocks is up to you (wait, use a bigger block, ask OS for more, etc).
Note that the whole memory is reserved from the OS at app start, which can be a problem.
If this idea makes your app faster, let me know. What I can say for sure is that memory used is greater than if you use normal OS request; but not much if you choose "good" sizes, those most used.
Some improvements can be done:
Cache the last freeded block (per size) so as to avoid the search.
Start with not that much blocks, and ask the OS for more memory only
when needed. Play with 'number of blocks' for each size depending on
your app. Find the optimal case.

Is there a way to "unfetch" a cache line?

Let's say I'm looping through 10 different 4kb arrays of ints, incrementing them:
int* buffers[10] = ...; // 10 4kb buffers, not next to each other
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 512; j++) {
buffers[i][j]++;
}
}
The compiler/CPU are pretty cool, and can do some cache prefetching for the inner loop. That's awesome. But...
...I've just eaten up to 40kb of cache, and kicked out data which the rest of my program enjoyed having in the cache.
It would be cool if I could hint to the compiler or CPU that "I'm not touching this memory again in the foreseeable future, so you can reuse these cache lines":
int* buffers[10] = ...;
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 512; j++) {
buffers[i][j]++;
}
// Unfetch entire 4kb buffer
cpu_cache_unfetch(buffers[i], 4096);
}
cpu_cache_unfetch would conceptually "doom" any cache lines in that range, throwing them away first.
In the end, this will mean that my little snippet of code uses 4kb of cache, instead of 40kb. It would reuse the 4kb of cache 10 times. The rest of the program would appreciate that very much.
Would this even make sense? If so, is there a way to do this?
Also appreciated: let me know all the ways I've shown myself to fundamentally misunderstand caching! =D
I only know the answer for x86. This is definitely architecture-specific; different ISAs have different cache-control features.
On x86, yes, clflush / clflushopt, but they only evict one single cache line per execution. (They force write-back + eviction, like you'd need for memory-mapped non-volatile storage). My understanding is that clflushopt is not usually worth it for a case like this, vs. just allowing cache pollution to happen.
In theory there are possible speedups from using NT prefetch for read-only, but that's brittle (tuning the software-prefetch depends on the HW, and getting it wrong can hurt a lot). Doing a regular store would probably undo the effects of an NT prefetch and leave the line in the most-recently-used position in L1, L2, and L3.
One possibly-crazy approach would be NT stores. Load a whole cache-line of data (four 16-byte vectors = 64 bytes), then store the updated values with movntdq.
NT means "non-temporal"; for use when data will not be referenced again in the near future (even by another core). What is the meaning of "non temporal" memory accesses in x86 has some pretty generic answers, but may help.
According to Intel's manual, NT stores evict the destination cache line if it was previously cached (What happens with a non-temporal store if the data is already in cache?), so it would work for your use-case. But the compiler would have to be sure to reach a 64-byte alignment boundary in the inner loop so it can read one or two whole cache lines, instead of reading 32 bytes of one and 32 bytes of another, and evicting it with an NT store before reading the last 32 bytes of a line. (Pointer math is easy in asm, though; Compilers do know how to go scalar until an alignment boundary.)
The normal use-case for NT stores is for write-only destination buffers to avoid the MESI RFO overhead, but this use-case is at least possibly a win.
See discussion in comments chat: this might perform significantly worse. Definitely benchmark both ways before doing this, preferably on a variety of hardware including multi-socket systems.
It's also almost definitely worse if the array was hot in cache to start with. I was assuming that this was the only thing to touch it, rather than the last in a chain of modifications.

Does the hardware-prefetcher benefit in this memory access pattern?

I have two arrays: A with N_A random integers and B with N_B random integers between 0 and (N_A - 1). I use the numbers in B as indices into A in the following loop:
for(i = 0; i < N_B; i++) {
sum += A[B[i]];
}
Experimenting on an Intel i7-3770, N_A = 256 million, N_B = 64 million, this loop takes only .62 seconds, which corresponds to a memory access latency of about 9 nanoseconds.
As this latency is too small, I was wondering if the hardware prefetcher is playing a role. Can someone offer an explanation?
The HW prefetcher can see through your first level of indirection (B[i]) since these elements are sequential. It's capable of issuing multiple prefetches ahead, so you could assume that the average access into B would hit the caches (either L1 or L2). However, there's no way that the prefetcher can predict random addresses (the data stored in B) and prefetch the correct elements from A. You still have to perform a memory access in almost all accesses to A (disregarding occasional lucky cache hits due to reuse of lines)
The reason you see such low latency is that the accesses into A are non serialized, the CPU can access multiple elements of A simultaneously, so the time doesn't just accumulate. In fact, you measure memory BW here, checking how long it takes to access 64M elements overall, not memory latency (how long it takes to access a single element).
A reasonable "snapshot" of the CPU memory unit should show several outstanding requests - a few accesses into B[i], B[i+64], ... (the intermediate accesses should simply get merged as each request fetches a 64Byte line), all of which would probably be prefetches reflecting future values of i, intermixed with random accesses to A elements according to the previously fetched elements of B.
To measure latency, you need each access to depends on the result of the previous one, for e.g. by making the content of each element in A the index of the next access.
The CPU charges ahead in the instruction stream and will juggle multiple outstanding loads at once. The stream looks like this:
load b[0]
load a[b[0]]
add
loop code
load b[1]
load a[b[1]]
add
loop code
load b[1]
load a[b[1]]
add
loop code
...
The iterations are only serialized by the loop code, which runs quickly. All loads can run concurrently. Concurrency is just limited by how many loads the CPU can handle.
I suspect you wanted to benchmark random, unpredictable, serialized memory loads. This is actually pretty hard on a modern CPU. Try to introduce an unbreakable dependency chain:
int lastLoad = 0;
for(i = 0; i < N_B; i++) {
var load = A[B[i] + (lastLoad & 1)]; //be sure to make A one element bigger
sum += load;
lastLoad = load;
}
This requires the last load to be executed until the address of the next load can be computed.

Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment:
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).
I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent
Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.

are array initialization operations cached as well

If you are not reading a value but assigning a value
for example
int array[] = new int[5];
for(int i =0; i < array.length(); i++){
array[i] = 2;
}
Still does the array come to the cache? Can't the cpu bring the array elements one by one to its registers and do the assignment and after that write the updated value to the main memory, bypasing the cache because its not necessary in this case ?
The answer depends on the cache protocol I answered assuming Write Back Write Allocate.
The array will still come to the cache and it will make a difference. When a cache block is retrieved from it's more than just a single memory location (the actual size depends on the design of the cache). So since arrays are stored in order in memory pulling in array[0] will pulling the rest of the block which will include (at least some of) array[1] array[2] array[3] and array[4]. This means the following calls will not have to access main memory.
Also after all this is done the values will NOT be written to memory immediately (under write back) instead the CPU will keep using the cache as the memory for reads/writes until that cache block is replaced from the cache at which point the values will be written to main memory.
Overall this is preferable to going to memory every time because the cache is much faster and the chances are the user is going to use the memory he just set relatively soon.
If the protocol is Write Through No Allocate then it won't bring the block into memory and it will right straight through to the main memory.

Resources