Cache aligned linked list - caching

I am aware about the concepts of data type or structure alignment, packing, padding issues etc. I have implemented a single linked list where each node occupies approximately 250 bytes i.e. around 4 times the size of cache line of 64 bytes. My machine is a Intel 64 bit architecture.
Now a single linked list is essentially a pointer chasing data structure and thus suffers from high number of cache misses. In order to reduce the cache misses, I aligned each data structure node using *posix_memalign* function to cache lined boundary of 64 bytes. Now all linked list nodes are cache aligned.
After doing this, I find that memory consumption of linked list has increased considerably and also the performance has actually deteriorated. Can anyone explain what could be possibly going wrong ?

I don't know what malloc you use but this is from tcmalloc
// For use by exported routines below that want specific alignments
//
// Note: this code can be slow for alignments > 16, and can
// significantly fragment memory. The expectation is that
// memalign/posix_memalign/valloc/pvalloc will not be invoked very
// often. This requirement simplifies our implementation and allows
// us to tune for expected allocation patterns.

Related

CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

It may seem a weird question..
Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7).
There are two objects A, B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64.
1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not evicted yet, CPU fetches B in a very short time. Everybody is happy.
2) If N >> 64 (i.e. much larger than 64), when A is fetched by CPU, B is not read into the cache line along with A. So we say "CPU doesn't like chase pointers around", and it is one of the reason to avoid heap allocated node-based data structure, like std::list.
My question is, if N > 64 but is still small, say N = 70, in other words, A and B do not fit in one cache line but are not too far away apart, when A is loaded by CPU, does fetching B takes the same amount of clock cycles as it would take when N is much larger than 64?
Rephrase - when A is loaded, let t represent the time elapse of fetching B, is t(N=70) much smaller than, or almost equal to, t(N=9999999)?
I ask this question because I suspect t(N=70) is much smaller than t(N=9999999), since CPU cache is hierarchical.
It is even better if there is a quantitative research.
There are at least three factors which can make a fetch of B after A misses faster. First, a processor may speculatively fetch the next block (independent of any stride-based prefetch engine, which would depend on two misses being encountered near each other in time and location in order to determine the stride; unit stride prefetching does not need to determine the stride value [it is one] and can be started after the first miss). Since such prefetching consumes memory bandwidth and on-chip storage, it will typically have a throttling mechanism (which can be as simple as having a modest sized prefetch buffer and only doing highly speculative prefetching when the memory interface is sufficiently idle).
Second, because DRAM is organized into rows and changing rows (within a single bank) adds latency, if B is in the same DRAM row as A, the access to B may avoid the latency of a row precharge (to close the previously open row) and activate (to open the new row). (This can also improve memory bandwidth utilization.)
Third, if B is in the same address translation page as A, a TLB may be avoided. (In many designs hierarchical page table walks are also faster in nearby regions because paging structures can be cached. E.g., in x86-64, if B is in the same 2MiB region as A, a TLB miss may only have to perform one memory access because the page directory may still be cached; furthermore, if the translation for B is in the same 64-byte cache line as the translation for A and the TLB miss for A was somewhat recent, the cache line may still be present.)
In some cases one can also exploit stride-base prefetch engines by arranging objects that are likely to miss together in a fixed, ordered stride. This would seem to be a rather difficult and limited context optimization.
One obvious way that stride can increase latency is by introducing conflict misses. Most caches use simple modulo a power of two indexing with limited associativity, so power of two strides (or other mappings to the same cache set) can place a disproportionate amount of data in a limited number of sets. Once the associativity is exceeded, conflict misses will occur. (Skewed associativity and non-power-of-two modulo indexing have been proposed to reduce this issue, but these techniques have not been broadly adopted.)
(By the way, the reason pointer chasing is particularly slow is not just low spatial locality but that the access to B cannot be started until after the access to A has completed because there is a data dependency, i.e., the latency of fetching B cannot be overlapped with the latency of fetching A.)
If B is at a lower address than A, it won't be in the same cache line even if they're adjacent. So your N < 64 case is misnamed: it's really the "same cache line" case.
Since you mention Intel i7: Sandybridge-family has a "spatial" prefetcher in L2, which (if there aren't a lot of outstanding misses already) prefetches the other cache line in a pair to complete a naturally-aligned 128B pair of lines.
From Intel's optimization manual, in section 2.3 SANDY BRIDGE:
2.3.5.4 Data Prefetching
... Some prefetchers fetch into L1.
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with
the pair line that completes it to a 128-byte aligned chunk.
... several other prefetchers try to prefetch into L2
IDK how soon it does this; if it doesn't issue the request until the first cache line arrives, it won't help much for a pointer-chasing case. A dependent load can execute only a couple cycles after the cache line arrives in L1D, if it's really just pointer-chasing without a bunch of computation latency. But if it issues the prefetch soon after the first miss (which contains the address for the 2nd load), the 2nd load could find its data already in L1D cache, having arrived a cycle or two after the first demand-load.
Anyway, this makes 128B boundaries relevant for prefetching in Intel CPUs.
See Paul's excellent answer for other factors.

How does cacheline to register data transfer work?

Suppose I have an int array of 10 elements. With a 64 byte cacheline, it can hold 16 array elements from arr[0] to arr[15].
I would like to know what happens when you fetch, for example, arr[5] from the L1 cache into a register. How does this operation take place? Can the cpu pick an offset into a cacheline and read the next n bytes?
The cache will usually provide the full line (64B in this case), and a separate component in the MMU would rotate and cut the result (usually some barrel shifter), according to the requested offset and size. You would usually also get some error checks (if the cache supports ECC mechanisms) along the way.
Note that caches are often organized in banks, so a read may have to fetch bytes from multiple locations. By providing a full line, the cache can construct the bytes in proper order first (and perform the checks), before letting the MMU pick the relevant part.
Some designs focusing on power saving may decide to implement lower granularity, but this is often only adding complexity as you may have to deal with more cases of line segments being split.

Page boundaries, implementing memory pool

I have decided to reinvent the wheel for a millionth time and write my own memory pool. My only question is about page size boundaries.
Let's say GetSystemInfo() call tells me that the page size is 4096 bytes. Now, I want to preallocate a memory area of 1MB (could be smaller, or larger), and divide this area into 128 byte blocks. HeapAlloc()/VirtualAlloc() will have an overhead between 8 and 16 bytes I guess. Might be some more, I've read posts talking about 60 bytes.
Question is, do I need to pay attention to not to have one of my 128 byte blocks across page boundaries?
Do I simply allocate 1MB in one chunk and divide it into my block size?
Or should I allocate many blocks of, say, 4000 bytes (to take into account HeapAlloc() overhead), and sub-divide this 4000 bytes into 128 byte blocks (4000 / 128 = 31 blocks, 128 bytes each) and not use the remaining bytes at all (4000 - 31x128 = 32 bytes in this example)?
Having a block cross a page boundary isn't a huge deal. It just means that if you try to access that block and it's completely swapped out, you'll get two page faults instead of one. The more important thing to worry about is the alignment of the block.
If you're using your small block to hold a structure that contains native types longer than 1 byte, you'll want to align it, otherwise you face potentially abysmal performance that will outweigh any performance gains you may have made by pooling.
The Windows pooling function ExAllocatePool describes its behaviour as follows:
If NumberOfBytes is PAGE_SIZE or greater, a page-aligned buffer is
allocated. Memory allocations of PAGE_SIZE or less do not cross page
boundaries. Memory allocations of less than PAGE_SIZE are not
necessarily page-aligned but are aligned to 8-byte boundaries in
32-bit systems and to 16-byte boundaries in 64-bit systems.
That's probably a reasonable model to follow.
I'm generally of the idea that larger is better when it comes to a pool. Within reason, of course, and depending on how you are going to use it. I don't see anything wrong with allocating 1 MB at a time (I've made pools that grow in 100 MB chunks). You want it to be worthwhile to have the pool in the first place. That is, have enough data in the same contiguous region of memory that you can take full advantage of cache locality.
I've found out that if I used _align_malloc(), I wouldn't need to worry wether spreading my sub-block to two pages would make any difference or not. An answer by Freddie to another thread (How to Allocate memory from a new virtual page in C?) also helped. Thanks Harry Johnston, I just wanted to use it as a memory pool object.

Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment:
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).
I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent
Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.

why is data structure alignment important for performance?

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.
And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?
Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.
Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:
setup time: time it takes for the memory devices to access the memory locations
bus arbitration overhead i.e. many devices might want access to the memory device
"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.
In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.
the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple),
all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.
now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.
A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.
Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.
myrandomreader:
; ...
; ten instructions to generate next pseudo-random
; address in ESI from previous address
; ...
MOV EAX, DS:[ESI] ; X
LOOP myrandomreader
For memory read straddling two cachelines:
(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall
(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one
Processor stalls
A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.
There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.
If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.
On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!
Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.
On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).
In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

Resources