Data Oriented Design with Mike Acton - Are 'loops per cache line' calculations right? - caching

I've watched Mike Acton's talks about DOD a few times now to better understand it (it is not an easy subject to me). I'm referring to CppCon 2014: Mike Acton "Data-Oriented Design and C++"
and GDC 2015: How to Write Code the Compiler Can Actually Optimize.
But in both talks he presents some calculations that I'm confused with:
This shows that FooUpdateIn takes 12 bytes, but if you stack 32 of them you will get 6 fully packed cache lines. Same goes for FooUpdateOut, it takes 4 bytes and 32 of them gives you 2 fully packed cache lines.
In the UpdateFoos function, you can do ~5.33 loops per each cache line (assuming that count is indeed 32), then he proceeds by assuming that all the math done takes about 40 cycles which means that each cache line would take about 213.33 cycles.
Now here's where I'm confused, isn't he forgetting about reads and writes? Even though he has 2 fully packed data structures they are in different memory spaces.
In my head this is what's happening:
Read in[0].m_Velocity[0] (which would take about 200 cycles based on his previous slides)
Since in[0].m_Velocity[1] and in[0].m_Foo are in the same cache line as in[0].m_Velocity[0] their access is free
Do all the calculation
Write the result to out[0].m_Foo - Here is what I don't know what happens, I assume that it would discard the previous cache line (fetched in 1.) and load the new one to write the result
Read in[1].m_Velocity[0] which would discard again another cache line (fetched in 4.) (which would take again about 200 cycles)
...
So jumping from in and out the calculations goes from ~5.33 loops/cache line to 0.5 loops/cache line which would do 20 cycles per cache line.
Could someone explain why wasn't he concerned about reads/writes? Or what is wrong in my thinking?
Thank you.

If we assume L1 cache is 64KB and one cache line is 64 bytes then there are total 1000 cache lines. So, in step 4 write to the result out[0].m_Foo will not discard the data cache in step 2 as they both are in different memory locations. This is the reason why he is using separate structure for updating out m_Foo instead directly mutating it in inplace like in his first implementation. He is just talking till point of calculation value. Updating value/writing value will have same cost as in his first implementation. Also, processor can optimize loops quite well as it can do multiple calculations in parallel(not sequential as result of first loop and second loop are not dependent). I hope this helps

Related

Why is two lines that differ in their address by precisely 65,536 bytes cannot be stored in the cache at the same?

I read a book Andrew Tanenbaum - structured computer organization (6th edition) - 2012, and I dont understand it.
"This mapping scheme puts consecutive memory lines in consecutive cache entries.In fact, up to 64 KB of contiguous data can be stored in the cache.However,two lines that differ in their address by precisely 65,536 bytes or any integral multiple of that number cannot be stored in the cache at the same time (because they have the same Line value).For example, if a program accesses data at location X and next executes an instruction that needs data at location X + 65,536 (or anyother location within the same line), the second instruction will force the cache entry to be reloaded, overwriting what was there.If this happens often enough, itcan result in poor behavior.In fact, the worst-case behavior of a cache is worsethan if there were no cache at all, since each memory operation involves reading in an entire cache line instead of just one word."
Why are they have the same Line value?
This is because of two concepts in cache design. First, a concept called associativity in cache design. For every possible input cache-line address (64 byte aligned on a modern x86-64 system) there are only N possible slots in the cache it may access.
The second is the a problem much like what is encountered with the hash function used within a hashmap. Simply put, some scheme has to be used in converting input addresses to slots in the cache. Notice that the book says the cache can hold 64 (presumably imperial) kilobytes. 64 kB is 65,536 bytes, and the magical cache-ruining distance in question is ALSO 65,536! So, in this case the address -> cache slot function is a simple and operation, and it appears the author is talking about a 1-way associativity cache (that is, each line may only be stored in ONE location inside the cache.) Leading to the mentioned conflict.
Why would microprocessor designers choose a simple AND function? Well... Because it's simple, mainly. Instead of wasting transistors on more complex logic, a basic operation like AND will suffice.

Writing a full cache line at an uncached address before reading it again on x64

On x64 if you first write within a short period of time the contents of a full cache line at a previously uncached address, and then soon after read from that address again can the CPU avoid having to read the old contents of that address from memory?
As effectively it shouldn't matter what the contents of the memory was previously because the full cache line worth of data was fully overwritten? I can understand that if it was a partial cache line write of an uncached address, followed by a read then it would incur the overhead of having to synchronise with main memory etc.
Looking at documentation regards write allocate, write combining and snooping has left me a little confused about this matter. Currently I think that an x64 CPU cannot do this?
In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!
Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.
In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.
Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.
In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.
The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.
From the above linked document, here's what Agner says about store-forwarding on Skylake:
The Skylake processor can forward a memory write to a subsequent read
from the same address under certain conditions. Store forwarding is
one clock cycle faster than on previous processors. A memory write
followed by a read from the same address takes 4 clock cycles in the
best case for operands of 32 or 64 bits, and 5 clock cycles for other
operand sizes.
Store forwarding has a penalty of up to 3 clock cycles extra when an
operand of 128 or 256 bits is misaligned.
A store forwarding usually takes 4 - 5 clock cycles extra when an
operand of any size crosses a cache line boundary, i.e. an address
divisible by 64 bytes.
A write followed by a smaller read from the same address has little or
no penalty.
A write of 64 bits or less followed by a smaller read has a penalty of
1 - 3 clocks when the read is offset but fully contained in the
address range covered by the write.
An aligned write of 128 or 256 bits followed by a read of one or both
of the two halves or the four quarters, etc., has little or no
penalty. A partial read that does not fit into the halves or quarters
can take 11 clock cycles extra.
A read that is bigger than the write, or a read that covers both
written and unwritten bytes, takes approximately 11 clock cycles
extra.
The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.
Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.
There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.

How does cacheline to register data transfer work?

Suppose I have an int array of 10 elements. With a 64 byte cacheline, it can hold 16 array elements from arr[0] to arr[15].
I would like to know what happens when you fetch, for example, arr[5] from the L1 cache into a register. How does this operation take place? Can the cpu pick an offset into a cacheline and read the next n bytes?
The cache will usually provide the full line (64B in this case), and a separate component in the MMU would rotate and cut the result (usually some barrel shifter), according to the requested offset and size. You would usually also get some error checks (if the cache supports ECC mechanisms) along the way.
Note that caches are often organized in banks, so a read may have to fetch bytes from multiple locations. By providing a full line, the cache can construct the bytes in proper order first (and perform the checks), before letting the MMU pick the relevant part.
Some designs focusing on power saving may decide to implement lower granularity, but this is often only adding complexity as you may have to deal with more cases of line segments being split.

Gnuplot memory usage trace

Before I dive deep in gnuplot I have some questions regarding its usage.
The data file contains several lines in the following pattern, separated by white space
S Allocation_time Free_Time
40 1259244359200 1259244360041
6 1259244363756 1259244367637
6 1259244368304 1259244368555
6 1259244368494 1259244369337
6 1259244359583 1259244369517
308 1259244361496 1259244369713
12 1259244361291 1259244369875
28 1259244362636 1259244370017
The first column is the allocated size, the second one is the allocation time and the last is the free time.
What I want to achieve is a histogram over time, showing the total memory usage, based on the allocation and de allocation time.
So as long as the current's lines free time is smaller than the allocation time of the next line's, plot the allocation and deallocation accordingly.
However, sometimes the free_time can be larger than the next allocation (the allocation is not supposed to get freed yet), so I want to store those occasions in a structure, and when eventually there is something to free, search through this structure and compare the allocation and free time.
Is this even possible in gnuplot, or should I look in another plotting mechanism? This trace file is input to a program I wrote, and I would like to verify that the output is similar to the original tracing.
Thank you everyone in advance.

Understanding caches and block sizes

A quick question to make sure I understand the concept behind a "block" and its usage with caches.
If I have a small cache that holds 4 blocks of 4 words each. Let's say its also directly mapped. If I try to access a word at memory address 2, would the block that contains words 0-3 be brought into the first block position of the cache or would it bring in words 2-5 instead?
I guess my question is how "blocks" exist in memory. When a value is accessed and a cache miss is trigger, does the CPU load one block's worth of data (4 words) starting at the accessed value in memory or does it calculate what block that word in memory is in and brings that block instead.
If this question is hard to understand, I can provide diagrams to what I'm trying to explain.
Usually caches are organized into "cache lines" (or, as you put it, blocks). The contents of the cache need to be associatively addressed, ie, accessed by using some portion of the requested address (ie "lookup table key" if you will). If the cache uses a block size of 1 word, the entire address -- all N bits of it -- would be the "key". Each word would be accessible with the granularity just described.
However, this associative key matching process is very hardware intensive, and is the bottleneck in both design complexity (gates used) and speed (if you want to use fewer gates, you take a speed hit in the tradeoff). Certainly, at some point, you cannot minimize gate usage by trading off for speed (delay in accessing the desired element), because a cache's whole purpose is to be FAST!
So, the tradeoff is done a little differently. The cache is organized into blocks (cache "lines" or "rows"). Each block usually starts at some 2^N aligned boundary corresponding to the cache line size. For example, for a cache line of 128 bytes, the cache line key address will always have 0's in the bottom seven bits (2^7 = 128). This effectively eliminates 7 bits from the address match complexity we just mentioned earlier. On the other hand, the cache will read the entire cache line into the cache memory whenever any part of that cache line is "needed" due to a "cache miss" -- the address "key" is not found in the associative memory.
Now, it seems like, if you needed byte 126 in a 128-byte cache line, you'd be twiddling your thumbs for quite a while, waiting for that cache block to be read in. To accomodate that situation, the cache fill can take place starting with the "critical cache address" -- the word that the processor needs to complete the current fetch cycle. This allows the CPU to go on its merry way very quickly, while the cache control unit proceeds onward -- usually by reading data word by word in a modulo N fashion (where N is the cache line size) into the cache memory.
The old MPC5200 PowerPC data book gives a pretty good description of this kind of critical word cache fill ordering. I'm sure it's used elsewhere as well.
HTH... JoGusto.

Resources