x86 store when data is in 2 different blocks - performance

Supose linux-32: the aligment rules say, for example, that doubles (8 Bytes) must be aligned to 4 Bytes. This means that, if we assume 64 Bytes cache blocks (a typical value for modern processors) we can have a double aligned in the 60th position, which mean that this double will be in 2 different cache blocks.
It could even happen that both parts of the double were in 2 different cache blocks located in 2 different 4KB pages.
After this brief introduction to put the question in context, I have a couple of doubts:
1- For an assembler programming where we seek maximum performance, it is recommended to prevent these things from happenning by putting alignment directives, right? Or, for any reason that I unknow, making the alignment to make the double in only 1 block doesn't imply any performance change?
2- How will be the store instruction decoded in the in the mentioned case? (supose modern intel microarchitecture). I mean, I know that a normal store x86 instruction is decoded in a micro-fused pair of str-addr and str-data, but in this case where 2 different cache blocks (and maybe even 2 different 4KB pages) are involved, this will be decoded in 2 micro-fused pair of str-addr and str-data (one for the first 4 bytes of the double and another for the last 4 bytes)? Or it will be decoded to a single micro-fused pair but having to do both the str-addr and the str-data twice the work until finally being able to exit the execution port?

Yes, of course you should align a double whenever possible, like compilers do except when forced by ABI struct-layout rules to misalign them. (The ABI was designed when i386 was current so a double always required 2 loads anyway.)
The current version of the i386 System V ABI requires 16-byte stack alignment, so local doubles (that have to get spilled at all instead of kept in regs) can be aligned, and malloc has to return memory suitable for any type, and alignof(max_align_t) = 16 on 32-bit Linux (8 on 32-bit Windows) so 32-bit malloc will always give you at least 16 (or 8)-byte aligned memory. And of course in static storage you control the alignment with align (NASM) or .p2align (GAS) directives.
For the perf downsides of cacheline splits and page splits, see How can I accurately benchmark unaligned access speed on x86_64
re: decoding: The address isn't know at decode time so obviously any effects of a line-split page-split are resolved later. For stores, probably no effect until the store-buffer entry has to commit to L1d cache. Are two store buffer entries needed for split line/page stores on recent Intel? - probably no, allocating a 2nd entry after executing the store-address uop is implausible.
For loads, re-running the load through the execution unit to get the other half (or whatever uneven split), using internal line-split buffers to combine data. (Not re-dispatching from the RS, just internally handled in the load port. But the RS does aggressively replay uops waiting for the result of a load.)
Re-running the store-data uop for a misaligned store seems unlikely, too. I don't think we see extra counts for uops_dispatched_port.port_4 perf events.

Related

Writing a full cache line at an uncached address before reading it again on x64

On x64 if you first write within a short period of time the contents of a full cache line at a previously uncached address, and then soon after read from that address again can the CPU avoid having to read the old contents of that address from memory?
As effectively it shouldn't matter what the contents of the memory was previously because the full cache line worth of data was fully overwritten? I can understand that if it was a partial cache line write of an uncached address, followed by a read then it would incur the overhead of having to synchronise with main memory etc.
Looking at documentation regards write allocate, write combining and snooping has left me a little confused about this matter. Currently I think that an x64 CPU cannot do this?
In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!
Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.
In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.
Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.
In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.
The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.
From the above linked document, here's what Agner says about store-forwarding on Skylake:
The Skylake processor can forward a memory write to a subsequent read
from the same address under certain conditions. Store forwarding is
one clock cycle faster than on previous processors. A memory write
followed by a read from the same address takes 4 clock cycles in the
best case for operands of 32 or 64 bits, and 5 clock cycles for other
operand sizes.
Store forwarding has a penalty of up to 3 clock cycles extra when an
operand of 128 or 256 bits is misaligned.
A store forwarding usually takes 4 - 5 clock cycles extra when an
operand of any size crosses a cache line boundary, i.e. an address
divisible by 64 bytes.
A write followed by a smaller read from the same address has little or
no penalty.
A write of 64 bits or less followed by a smaller read has a penalty of
1 - 3 clocks when the read is offset but fully contained in the
address range covered by the write.
An aligned write of 128 or 256 bits followed by a read of one or both
of the two halves or the four quarters, etc., has little or no
penalty. A partial read that does not fit into the halves or quarters
can take 11 clock cycles extra.
A read that is bigger than the write, or a read that covers both
written and unwritten bytes, takes approximately 11 clock cycles
extra.
The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.
Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.
There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.

Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment:
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).
I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent
Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.

Why is alignment imporant?

I know that some processors fail with misaligned data, and others like the oh-so-common x86, would just be slower with that.
My question is why? Why is it harder for an x86 processor to get the data from the pointer 0x12345679 than it is from the pointer 0x12345678? Just to be clear, I'm aware that page faults may happen if the data is in multiple pages, and I understand that more data may need to be fetched from memory (one part for the start of the value and one for the end), but that isn't always true and this isn't what my question is about. I'm asking, why is it always slower?
Suppose the memory starts at 0x10000000. Why is it harder for the processor to get a 2-byte short from 0x10000001 than it is from 0x10000002? Why is it harder to get a 4-byte int from 0x10000001 than it is from 0x10000000? And so forth.
Because the data bus is wider than eight bits.
Let assume that the data bus is 32 bits. To get 16 bits from address 0x10000001, it has to get the four bytes that starts at 0x10000000 and shift the value to get the two bytes in the middle.
To get 16 bits from the address 0x10000003, it has to get the words that start at 0x10000000 and 0x10000004, and use one byte from each value.
The processor can only access memory in an aligned fashion. This is a consequence of how the interconnect between the processor and memory functions.
When a processor supports unaligned reads, what's really happening is the processor issuing two separate reads (or one read of larger size) and stitching the parts together, which is why it's slower than an aligned read.
One example: if the databus is 32 bits and a 32 bit value is not on a 32 bit boundary, the bytes will have to be fetched in more than one operation and moved around to load the value properly into a processor register.

why is data structure alignment important for performance?

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.
And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?
Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.
Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:
setup time: time it takes for the memory devices to access the memory locations
bus arbitration overhead i.e. many devices might want access to the memory device
"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.
In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.
the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple),
all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.
now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.
A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.
Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.
myrandomreader:
; ...
; ten instructions to generate next pseudo-random
; address in ESI from previous address
; ...
MOV EAX, DS:[ESI] ; X
LOOP myrandomreader
For memory read straddling two cachelines:
(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall
(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one
Processor stalls
A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.
There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.
If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.
On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!
Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.
On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).
In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

Understanding word alignment

I understand what it means to access memory such that it is aligned but I don’t understand why this is necessary. For instance, why can I access a single byte from an address 0x…1 but I cannot access a half word (two bytes) from the same address.
Again, I understand that if you have an address A and an object of size s that the access is aligned if A mod s = 0. But I just don’t understand why this is important at the hardware level.
Hardware is complex; this is a simplified explanation.
A typical modern computer might have a 32-bit data bus. This means that any fetch that the CPU needs to do will fetch all 32 bits of a particular memory address. Since the data bus can't fetch anything smaller than 32 bits, the lowest two address bits aren't even used on the address bus, so it's as if RAM is organised into a sequence of 32-bit words instead of 8-bit bytes.
When the CPU does a fetch for a single byte, the read cycle on the bus will fetch 32 bits and then the CPU will discard 24 of those bits, loading the remaining 8 bits into whatever register. If the CPU wants to fetch a 32 bit value that is not aligned on a 32-bit boundary, it has several general choices:
execute two separate read cycles on the bus to load the appropriate parts of the data word and reassemble them
read the 32-bit word at the address determined by throwing away the low two bits of the address
read some unexpected combination of bytes assembled into a 32-bit word, probably not the one you wanted
throw an exception
Various CPUs I have worked with have taken all four of those paths. In general, for maximum compatibility it is safest to align all n-bit reads to an n-bit boundary. However, you can certainly take shortcuts if you are sure that your software will run on some particular CPU family with known unaligned read behaviour. And even if unaligned reads are possible (such as on x86 family CPUs), they will be slower.
The computer always reads in some fixed size chunks which are aligned.
So, if you don't align your data in memory, you will have to probably read more than once.
Example
word size is 8 bytes
your structure is also 8 bytes
if you align it, you'll have to read one chunk
if you don't align it, you'll have to read two chunks
So, it's basically to speed up.
The reason for all alignment rules are the various widths of the Cache Lines (Instruction-Cache do have 16 Byte lines for the Core2 Architecture, and the Data-Cache do have 64-Byte Lines for L1 and 128-Byte Lines for L2).
So if you want to store/load data that crosses a Cahce-Line Boundary you need to load and store both Cache-lines, which hits the performance.
So you just don't do it because of the performance hit, its that simple.
Try reading a serial port. The data is 8 bits wide.
Nice hardware designers ensure it lies on a least significant byte of the word.
If you have a C structure that has elements not word aligned ( from backwards compatibility or conservation of memory say )
then the address of any byte within the structure is not word aligned.

Resources