MESI protocol snoop implementation issue - caching

I have a MESI protocol question. Assume that I have two cores (core 1 and 2) and each core has its own l2 cache. When two core has the same data and cache lines are in status S, meaning they both have clean and the same data. At t=0, core 1 writes the cache line and core 1 will switch to M (modified) and core 2 eventually will be in I (invalid) state. In physical world, it takes time for this transaction to finish. Let's say it takes 5 seconds for cache 2 knows that cache 1 updated the cache line.
Assume that at t=2, core 2 writes the same cache line and switch to M status. This write action from core 2 will be notified to the core 1 at t=7 (2+5). Then core 2 needs to invalidates cache 2 at t=5 and core 1 invalidates the line at t=7. Now both lines are invalidated and the data written by the core 1 and then core 2 got lost. This is obviously not following the protocol. What is wrong with my logic and how to prevent this nonsense?

The two cores have to agree with each other to update. You can do this via snoopy or directory based protocol. So in your example, the caches cannot change their state but rather request to change. Whoever then wins the arbitration gets to change to modified, while the other is invalidated.
These slides seem to sum it up pretty well. You want to look at slide 20 onward for snoopy protocol as an example.

Related

Data races with MESI optimization

I dont really understand what exactly is causing the problem in this example:
Here is a snippet from my book:
Based on the discussion of the MESI protocol in the preceding section, it would
seem that the problem of data sharing between L1 caches in a multicore machine
has been solved in a watertight way. How, then, can the memory ordering
bugs we’ve hinted at actually happen?
There’s a one-word answer to that question: Optimization. On most hardware,
the MESI protocol is highly optimized to minimize latency. This means
that some operations aren’t actually performed immediately when messages
are received over the ICB. Instead, they are deferred to save time. As with
compiler optimizations and CPU out-of-order execution optimizations, MESI
optimizations are carefully crafted so as to be undetectable by a single thread.
But, as you might expect, concurrent programs once again get the raw end of
this deal.
For example, our producer (running on Core 1) writes 42 into g_data and
then immediately writes 1 into g_ready. Under certain circumstances, optimizations
in the MESI protocol can cause the new value of g_ready to become
visible to other cores within the cache coherency domain before the updated
value of g_data becomes visible. This can happen, for example, if Core
1 already has g_ready’s cache line in its local L1 cache, but does not have
g_data’s line yet. This means that the consumer (on Core 2) can potentially
see a value of 1 for g_ready before it sees a value of 42 in g_data, resulting in
a data race bug.
Here is the code:
int32_t g_data = 0;
int32_t g_ready = 0;
void ProducerThread() // running on Core 1
{
g_data = 42;
// assume no instruction reordering across this line
g_ready = 1;
}
void ConsumerThread() // running on Core 2
{
while (!g_ready)
PAUSE();
// assume no instruction reordering across this line
ASSERT(g_data == 42);
}
How can g_data be computed but not present in the cache?
This can happen, for example, if Core
1 already has g_ready’s cache line in its local L1 cache, but does not have
g_data’s line yet.
If g_data is not in cache, then why does the previous sentece end with a yet? Would the CPU load the cache line with g_data after it has been computed?
If we read this sentence:
This means that some operations aren’t actually performed immediately when messages are received over the ICB. Instead, they are deferred to save time.
Then what operation is deferred in our example with producer and consumer threads?
So basically I dont understand how under the MESI protocol, some operations are visible to other cores in the wrong order, despite being computed in the right order by a specific core.
PS:
This example is from a book called "Game Engine Architecture, Third Edition" by Jason Gregory, its on the page 309. Here is the book

Data Oriented Design with Mike Acton - Are 'loops per cache line' calculations right?

I've watched Mike Acton's talks about DOD a few times now to better understand it (it is not an easy subject to me). I'm referring to CppCon 2014: Mike Acton "Data-Oriented Design and C++"
and GDC 2015: How to Write Code the Compiler Can Actually Optimize.
But in both talks he presents some calculations that I'm confused with:
This shows that FooUpdateIn takes 12 bytes, but if you stack 32 of them you will get 6 fully packed cache lines. Same goes for FooUpdateOut, it takes 4 bytes and 32 of them gives you 2 fully packed cache lines.
In the UpdateFoos function, you can do ~5.33 loops per each cache line (assuming that count is indeed 32), then he proceeds by assuming that all the math done takes about 40 cycles which means that each cache line would take about 213.33 cycles.
Now here's where I'm confused, isn't he forgetting about reads and writes? Even though he has 2 fully packed data structures they are in different memory spaces.
In my head this is what's happening:
Read in[0].m_Velocity[0] (which would take about 200 cycles based on his previous slides)
Since in[0].m_Velocity[1] and in[0].m_Foo are in the same cache line as in[0].m_Velocity[0] their access is free
Do all the calculation
Write the result to out[0].m_Foo - Here is what I don't know what happens, I assume that it would discard the previous cache line (fetched in 1.) and load the new one to write the result
Read in[1].m_Velocity[0] which would discard again another cache line (fetched in 4.) (which would take again about 200 cycles)
...
So jumping from in and out the calculations goes from ~5.33 loops/cache line to 0.5 loops/cache line which would do 20 cycles per cache line.
Could someone explain why wasn't he concerned about reads/writes? Or what is wrong in my thinking?
Thank you.
If we assume L1 cache is 64KB and one cache line is 64 bytes then there are total 1000 cache lines. So, in step 4 write to the result out[0].m_Foo will not discard the data cache in step 2 as they both are in different memory locations. This is the reason why he is using separate structure for updating out m_Foo instead directly mutating it in inplace like in his first implementation. He is just talking till point of calculation value. Updating value/writing value will have same cost as in his first implementation. Also, processor can optimize loops quite well as it can do multiple calculations in parallel(not sequential as result of first loop and second loop are not dependent). I hope this helps

Writing a full cache line at an uncached address before reading it again on x64

On x64 if you first write within a short period of time the contents of a full cache line at a previously uncached address, and then soon after read from that address again can the CPU avoid having to read the old contents of that address from memory?
As effectively it shouldn't matter what the contents of the memory was previously because the full cache line worth of data was fully overwritten? I can understand that if it was a partial cache line write of an uncached address, followed by a read then it would incur the overhead of having to synchronise with main memory etc.
Looking at documentation regards write allocate, write combining and snooping has left me a little confused about this matter. Currently I think that an x64 CPU cannot do this?
In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!
Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.
In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.
Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.
In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.
The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.
From the above linked document, here's what Agner says about store-forwarding on Skylake:
The Skylake processor can forward a memory write to a subsequent read
from the same address under certain conditions. Store forwarding is
one clock cycle faster than on previous processors. A memory write
followed by a read from the same address takes 4 clock cycles in the
best case for operands of 32 or 64 bits, and 5 clock cycles for other
operand sizes.
Store forwarding has a penalty of up to 3 clock cycles extra when an
operand of 128 or 256 bits is misaligned.
A store forwarding usually takes 4 - 5 clock cycles extra when an
operand of any size crosses a cache line boundary, i.e. an address
divisible by 64 bytes.
A write followed by a smaller read from the same address has little or
no penalty.
A write of 64 bits or less followed by a smaller read has a penalty of
1 - 3 clocks when the read is offset but fully contained in the
address range covered by the write.
An aligned write of 128 or 256 bits followed by a read of one or both
of the two halves or the four quarters, etc., has little or no
penalty. A partial read that does not fit into the halves or quarters
can take 11 clock cycles extra.
A read that is bigger than the write, or a read that covers both
written and unwritten bytes, takes approximately 11 clock cycles
extra.
The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.
Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.
There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.

CPU cache hit time

I am reading "computer organization and design the hardware/software interface" in spanish edition, and I have run into an exercise that I can not solve. The exercise is about memory hierarchy, specifically caches.
The exercise says:
If 2.5 ns are required to access labels in an N-way associative cache, 4 ns for access data, 1 ns for hit/failure comparison and 1 ns to return the data selected by the processor in case of success.
The critical path in a cache hit, is given by the time to determine whether there has been success or time data access?
What is the hit latency of the cache? (successful case).
What would be the latency of success in the cache if both the access time to labels and the data matrix is 3 ns?
I'll try to answer the questions with all I know about memories.
To access a data saved in the cache, the first thing I have to do is find the line using the index field of some address. Once the memory system have found the line, I need to compare the label field of my address with the label field of the cache. If they match, then it is a hit and I have to return the data, and displace an amount of data in the line determined by the offset field of the address and then return the data to the processor.
That implies that the cache will take 8.5 ns. But I have been thinking in another way that chaches can do it: if I get the desired line (2.5 ns) then now I can access de data, and in parallel, I can evaluate the condition of iquality. So, the time will be 4.5 ns. So, one of these are the result of the second question. Which of these results is correct?
For the first question, the critical path will be the operation that takes the larger amount of time; if the cache takes 4.5 to get the data, then the critical path will be access the labels in the cache - comparison - return the data. Otherwise, it will be the entire process.
For the last question, if the critical path is the entire process then it will take 8ns. Else, it will take 5ns (labels access in the cache, comparison, return the data).
This is true?, and what about a fully assoctive cache?, and a direct mapping cache?
The problem is that I do not know what things the cache do first and what next or in parallel.
If the text does'nt say anything about if it's a cache in a uniprocessor system/multiprocessor system or what the cache does in parallel you can safely assume that it performs the whole process in case of a cache hit. Instinctively I think it does'nt make sense to access the data and compare hit/miss in parallel, what if it's a miss? then then the data access is unneccessary and you increase the latency of the cache miss.
So then you get the following sequence in case of a cache hit:
Access label (2.5ns)
Compare hit/miss (1ns)
Access the data (4ns)
Return the data to the program requesting it (1ns)
Total: 2.5 + 1 + 4 +1 = 8.5ns
With this sequence we get (as you already now) the following answers to the questions:
Answer: The critical path in a cache hit is to access the data and return it 4+1=5 (ns), compared to determine wether the cache lookup was a success: 2.5 + 1 = 3.5 (ns)
Answer: 8.5ns
Answer: 3 + 1 + 3 + 1 = 8ns
If I get the desired line (2.5 ns) then now I can access de data, and
in parallel, I can evaluate the condition of iquality. So, the time
will be 4.5 ns
I don't see how you get 4.5ns? If you assume that the access of data and the hit/failure comparison is executed in parallel then you get: 2.5 + 4 + 1 = 7ns in case of a cache hit. You would also get 7ns in case of a cache miss, compared to if you don't access memory until you know if it's a cache miss, then you get a miss latency of 2.5 +1 = 3.5ns instead, which makes it very uneffective to try to parallelize the hit/miss comparison with data access.
If you assume that the access of the label and the hit/miss comparison is done in parallel with data access you get: 4 + 1 = 5ns in case of a cache hit.
Obviously you cannot return the data in parallel with fetching the data, but if you imagine that would be possible and you access the label and do comparison and return the data in parallel with accessing the data then you get: 2.5 + 1 + 1 = 4.5ns.
what about a fully assoctive cache?, and a direct mapping cache?
A N-way associative cache (as the question refers to) is a fully associative cache. This means that cache blocks can be placed anywhere in the cache. Hence it's very flexible, but it also means that when we want to lookup a memory address in the cache we need to compare the tag with every block in the cache to know if the memory address we're looking for is cached or not. Consequently we get slower lookup time.
In a direct mapped cache, every cache block can only go in one spot in the cache. That spot is computed by looking at the memory address and computing the index-part of the address. Thus a direct-mapped cache can give very quick lookups but is not very flexible. Depending on the cache size, cache blocks can be replaced very often.
The terminology in the question is a bit confusing, "label" is usually called "tag" when speaking of cpu-caches.

The last level cache replacement policy

I've found a blog article about Intel IvyBridge cache replacement policy. He concluded that Ivy Bridge's L3 cache replacement policy is no longer pseudo-LRU.
Under the new cache replacement policy, suppose that there are 4 sets in L3, and set 0 and 1 are using by a process. Set 2 and 3 are available to allocate. If a new process from other cpu tries to load two pages into L3 cache, is it guaranteed that the new process loads its pages into set 2 and 3? In other words, if there are available cache sets in the last level cache, does HW always choose the available sets to load new pages?

Resources