Sequential program cache consistency - caching

I wanted to ask you a question regarding the consistency of the cache memory.
If I have a sequential program, I shouldn't have cache consistency problems because in any case the instructions are executed sequentially and consequently there is no danger that several processors will write the same memory location at the same time, in case there are is the shared memory.
Different case is the situation where I have a parallel program, so it runs on multiple processors and there is a high probability that there are cache consistency problems.
Quite right?

In a single-threaded program, unless otherwise programmed, it doesn't change the thread by itself, except if OS does (and when it does, all the same thread-states are re-loaded from memory into that cache so there is no problem about coherence in there).
In a multi-threaded program, an update on same variable found on other caches needs to inform those caches somehow. This causes a re-flow of data through all other caches. Maybe it's not a blocking effect on same thread but once user wants only updated values, the synchronization / locking will see a performance hit. Especially when there are also other variables being updated on very close addresses such that they're in same cache-line. That's why using 20-byte elements for locking resolution is worse than using 128-byte elements in an array of locks.
If CPUs did not have coherence, multi-threading wouldn't work efficiently. So, for some versions, they chose to broadcast an update to all caches (as in Snoop cache). But this is not efficient on high number of cores. If 1000 cores existed in same CPU, it would require a 1000-way broadcasting logic consuming a lot of area of circuitry. So they break the problem into smaller parts and add other ways like directory-based coherence & multiple chunks of multiple cores. But this adds more latency for the coherence.
On the other hand, many GPUs do not implement automatic cache coherence because
the algorithm given by developer is generally embarrassingly parallel with only few points of synchronization and multiple blocks of threads do not require to communicate with other blocks (when they do, they go through a common cache by developer's choice of instructions anyway)
there are thousands of streaming pipelines (not real cores) that just need to make memory requests efficiently or else there wouldn't be enough space for that many pipelines
high throughput is required instead of low-latency (no need for implicit coherence anywhere)
so multi-processors in a GPU are designed to do completely independent work from each other and adding automatic coherence would add little performance (if not subtract). When developer needs to synchronize data between multiple threads in GPU in same block, there are instructions for this and not using these do not make any valid data update. So it's just an optional cache coherence in GPU.

Related

Can CPU Out-of-Order-Execution cause memory reordering?

I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering.
In my opinion, Out-of-Order-Execution can't cause reordering because the results are always retired in-order as mentioned in this question.
To make my question more clear, let's say we have such an relax memory consistency architecture:
It doesn't have store buffer and invalidate queues
It can do Out-of-Order-Execution
Can memory reordering still happen in this architecture?
Does memory barrier has two functions, one is forbidding the Out-of-Order execution, the other is flushing invalidation queue and draining store buffer?
Yes, out of order execution can definitely cause memory reordering, such as load/load re-ordering
It is not so much a question of the loads being retired in order, as of when the load value is bound to the load instruction. Eg Load1 may precede Load2 in program order, Load2 gets its value from memory before Load1 does, and eg if there is an intervening store to the location read by Load2, then Load/load reordering has occurred.
However, certain systems, such as Intel P6 family systems, have additional mechanisms to detect such conditions to obtain stronger memory order models.
In these systems all loads are buffered until retirement, and if a possible store is detected to such a buffered but not yet retired load, then the load and program order instructions are “nuked”, and execution is resumed art, e.g., Load2.
I call this Freye’s Rule snooping, after I learned that Brad Freye at IBM had invented it many years before I thought I had. I believe the standard academic reference is Gharachorloo.
I.e. it is not so much buffering loads until retirement, as it is providing such a detection and correction mechanism associated with buffering loads until retirement. Many CPUs provide buffering until retirement but do not provide this detection mechanism.
Note also that this requires something like snoop based cache coherence. Many systems, including Intel systems that have such mechanisms also support noncoherent memory, e.g. memory that may be cached but which is managed by software. If speculative loads are allowed to such cacheable but non-coherent memory regions, the Freye’s Rule mechanism will not work and memory will be weakly ordered.
Note: I said “buffer until retirement”, but if you think about it you can easily come up with ways of buffering not quite until retirement. E.g. you can stop this snooping when all earlier loads have them selves been bound, and there is no longer any possibility of an intervening store being observed even transitively.
This can be important, because there is quite a lot of performance to be gained by “early retirement“, removing instructions such as loads from buffering and repair mechanisms before all earlier instructions have retired. Early retirement can greatly reduce the cost of out of order hardware mechanisms.

Memory models-/Cache coherence protocols: How TSO goes together with MESIF

Having just worked through my system's programming lecture material, I stumbled upon the crucial concepts of memory models as well as cache coherence protocols. Although they make sense as independent concepts, it is not really clear how they go together. Specifically, when looking at x86, I am working with an ISA enforcing the TSO memory model, and a CPU (in the case of Intel) using the MESIF cache coherence protocol.
In the beginning, the professor introduced cache coherence protocols as means of ensuring that to any core in the chip, it appears as if they all access one large, monolithic block of memory. Then, after wrapping up with cache coherence, he continued with memory models, specifically TSO (we were introduced to linearizability-/sequential consistency in our parallel programming class already). The following is a direct quote from the lecture material about the x86 memory model:
Standard for 64-bit x86 processors
Sometimes called Total Store Ordering (TSO)
Earlier 32-bit x86 implemented PRAM – weaker!
Write-to-read relaxation: later reads can bypass earlier writes
All processors see writes from one processor in the order they were issued.
Processors can see different interleavings of writes from different processors.
It seems as if we "solved" the problem of slow sequential consistency by introducing what is (yet another) layer in the cache hierarchy, namely the (ordered) store buffer.
To me, TSO seems orthogonal to the principles of cache coherence. We worked so hard to get our caches to match, only to add another layer in between not covered by cache coherence.
Questions:
Why are the store buffers not covered by cache coherence protocols? Is it assumed that writeback to L1 from store buffers is so fast, that inconsistencies due to intermediate writebacks would not be an issue in the majority of cases? (i.e. I reckon store buffer -> L1 transfer only takes a few cycles, so as soon as L1 receives the data, it sends a transaction across the bus, telling other cores to invalidate their copy)
How should I think of the two concepts of cache coherence as well as memory model? The way I understand it, memory models are the theoretical concepts of what we want, and cache coherence is part of the practical implementation to achieve said model.
Thank you so much in advance for the clarifications!
Best,
Felix
The sequential consistency model is the most commonly prescribed memory model for shared memory parallel programming. A parallel-program comprising multiple tasks or threads, sequential consistency requires two conditions as described below.
Program order execution: All memory operations in each task appear to execute in that task's program order.
Memory access atomicity: Memory operations (in all tasks of the parallel program) appear to execute one-at-a-time.
Every programmer assumes these conditions to reason about their parallel programs.
Unfortunately, sequential consistency is a less useful model than imagined. The main reason is the implementation cost of these two properties. Enforcing these
properties prohibit many basic compiler and hardware optimizations[1].
Other weak/relaxed memory models are proposed that relax these properties and allow compiler and hardware optimizations. These weak memory models trade
programmability for performance.
Why are the store buffers not covered by cache coherence protocols?
It is a design choice of TSO for performance reasons. Serving a load from the store buffer or serving a load when its preceding store (of different address) is still in the store buffer, reduces the store latency. To keep store buffers coherent, the load has to wait until all other processors have acknowledged receipt of the invalidates generated by the store. Moreover, most of the time, there may not be any copies of the store address in other caches (the variable is local to a task), then waiting for the acknowledges is a waste of time. In case other tasks share this variable, then waiting for the acknowledges can be explicitly enforced by using atomic or fence instructions.
How should I think of the two concepts of cache coherence as well as memory model?
Cache coherence protocols are concerned with serializing stores to the same memory location and ensuring that a load returns the value of the most recent store to the same memory location. Cache coherence protocols are required only when there are caches or multiple copies of the same memory location, and its job is to keep all the copies coherent.
A memory consistency model is concerned with the relative order of loads and stores (of the same task) to different memory locations. Any system that involves executing shared-memory parallel programs (multiple tasks or threads communicating through a shared memory) must define its memory consistency model.
Broadly, cache coherence protocols implement a part of the memory consistency model. More precisely, it is the combination of core pipeline, and cache coherence protocols (and every other component that a memory instruction traverses) must adhere to the memory model specifications.
[1]: Shared memory consistency models: a tutorial

how do processor knows about the latest copy of cache line in multiprocessor system

In multiprocessor system where each processor have its own copy of cache, how processor comes to know from where to get the copy of data.
As it will be present in its own cache,also in caches of other respective processors or main memory i.e. how it will come to know which copy is the latest one
Most processors (in particular x86 in our laptops, desktops, servers) have some hardware provided cache coherence
Often, some synchronization memory barrier instructions exist.
It is rumored that some synchronization machine instructions could be quite slow.
Actually, recent C++2011 and C2011 standards have specific wordings and atomic data types to deal with these, like C++11 std::atomic
In practice, you should use some well established standard library like pthreads (or the C++11 std::thread etc....)
In a typical modern cache coherent system, if the contents of a memory address are present in multiple caches, their content will be the same. Using the typical invalidation-based coherence mechanism, in order for a processor to change the content, it must gain exclusive ownership of that block of memory. This is done by invalidating any copies. Any subsequent request from a processor that previously had the block cached would result in a miss (the block was invalidated) and a coherence action will find the updated content in the writing processor's cache.
(In earlier implementations of cache coherence with write-through caches, a common bus to memory could be snooped to grab any content changes. Similarly, a processor changing content could broadcast or multicast the changes to any sharers. These methods would keep cached contents the same.)
A more subtle aspect of this process is memory consistency--how different processors see the orderings of memory accesses to different addresses. With sequential consistency all processors see a single ordering of every read and write in the system. This is the easiest consistency model to understand, but in order to support greater parallel operation hardware complexity increases (e.g., rather than waiting to confirm that no ordering conflicts exist, a processor can speculatively continue execution and rollback to a previous known-correct state if an ordering conflict occurred).
A relaxed consistency model allows reads and writes to have inconsistent orderings among different processors. To provide stronger ordering guarantees, memory barrier operations are provided. These operations guarantee that certain types of memory accesses later in program order for the processor performing the barrier operation will be observed by all other processors as occurring after the barrier and certain types of memory accesses (earlier for that processor) will be observed by all processors before the barrier.
A system using a relaxed consistency model could provide the same behavior as a sequential consistency model system by using memory barriers after every memory access. However, systems using a relaxed model will generally not handle such excessive use of barriers well since they are designed to exploit the relaxed demands on memory ordering.

CUDA: When to use shared memory and when to rely on L1 caching?

After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background?
Is shared memory simply there to let algorithms designed for CC < 2.0 run efficiently without modifications?
To collaborate via shared memory, threads in a block write to shared memory and synchronize with __syncthreads(). Why not simply write to global memory (through L1), and synchronize with __threadfence_block()? The latter option should be easier to implement since it doesn't have to relate to two different locations of values, and it should be faster because there is no explicit copying from global to shared memory. Since the data gets cached in L1, threads don't have to wait for data to actually make it all the way out to global memory.
With shared memory, one is guaranteed that a value that was put there remains there throughout the duration of the block. This is as opposed to values in L1, which get evicted if they are not used often enough. Are there any cases where it's better too cache such rarely used data in shared memory than to let the L1 manage them based on the usage pattern that the algorithm actually has?
2 big reasons why automatic caching is less efficient than manual scratch pad memory (applies to CPUs as well)
parallel accesses to random addresses are more efficient. Example: histogramming. Let's say you want to increment N bins, and each are > 256 bytes apart. Then due to coalescing rules, that will result in N serial reads/writes since global and cache memory is organized in large ~256byte blocks. Shared memory doesn't have that problem.
Also to access global memory, you have to do virtual to physical address translation. Having a TLB that can do lots of translations in || will be quite expensive. I haven't seen any SIMD architecture that actually does vector loads/stores in || and I believe this is the reason why.
avoids writing back dead values to memory, which wastes bandwidth & power. Example: in an image processing pipeline, you don't want your intermediate images to get flushed to memory.
Also, according to an NVIDIA employee, current L1 caches are write-through (immediately writes to L2 cache), which will slow down your program.
So basically, the caches get in the way if you really want performance.
As far as i know, L1 cache in a GPU behaves much like the cache in a CPU. So your comment that "This is as opposed to values in L1, which get evicted if they are not used often enough" doesn't make much sense to me
Data on L1 cache isn't evicted when it isn't used often enough. Usually it is evicted when a request is made for a memory region that wasn't previously in cache, and whose address resolves to one that is already in use. I don't know the exact caching algorithm employed by NVidia, but assuming a regular n-way associative, then each memory entry can only be cached in a small subset of the entire cache, based on it's address
I suppose this may also answer your question. With shared memory, you get full control as to what gets stored where, while with cache, everything is done automatically. Even though the compiler and the GPU can still be very clever in optimizing memory accesses, you can sometimes still find a better way, since you're the one who knows what input will be given, and what threads will do what (to a certain extent of course)
Caching data through several memory layers always needs to follow a cache-coherency protocol. There are several such protocols and the decision on which one is the most suitable is always a trade off.
You can have a look at some examples:
Related to GPUs
Generally for computing units
I don't want to get in many details, because it is a huge domain and I am not an expert. What I want to point out is that in a shared-memory system (here the term shared does not refer to the so called shared memory of GPUs) where many compute-units (CUs) need data concurrently there is a memory protocol that attempts to keep the data close to the units so that can fetch them as fast as possible. In the example of a GPU when many threads in the same SM (symmetric multiprocessor) access the same data there should be a coherency in the sense that if thread 1 reads a chunk of bytes from the global memory and in the next cycle thread 2 is going to access these data, then an efficient implementation would be such that thread 2 is aware that data are found already in L1 cache and can access it fast. This is what the cache coherency protocol attempts to achieve, to let all compute units be up to date with what data exist in caches L1, L2 and so on.
However, keeping threads up to date, or else, keeping threads in coherent states, comes at some cost which is essentially missing cycles.
In CUDA by defining the memory as shared rather than L1-cache you free it from that coherency protocol. So access to that memory (which is physically the same piece of whatever material it is) is direct and does not implicitly call the functionality of coherency protocol.
I don't know how fast should this be, I didn't perform any such benchmark but the idea is that since you don't pay anymore for this protocol the access should be faster!
Of course, the shared memory on NVIDIA GPUs is split in banks and if someone wants to use it for performance improvement should have a look at this before. The reason is bank conflicts that occur when two threads access the same bank and this causes serialization of the access..., but that's another thing link

why are separate icache and dcache needed [duplicate]

This question already has an answer here:
What does a 'Split' cache means. And how is it useful(if it is)?
(1 answer)
Closed 2 years ago.
Can someone please explain what do we gain by having a separate instruction cache and data cache.
Any pointers to a good link explaining this will also be appreciated.
The main reason is: performance. Another reason is power consumption.
Separate dCache and iCache makes it possible to fetch instructions and data in parallel.
Instructions and data have different access patterns.
Writes to iCache are rare. CPU designers are optimizing the iCache and the CPU architecture based on the assumption that code changes are rare. For example, the AMD Software Optimization Guide for 10h and 12h Processors states that:
Predecoding begins as the L1 instruction cache is filled. Predecode information is generated and stored alongside the instruction cache.
Intel Nehalem CPU features a loopback buffer, and in addition to this the Sandy Bridge CPU features a µop cache The microarchitecture of Intel, AMD and VIA CPUs. Note that these are features related to code, and have no direct counterpart in relation to data. They benefit performance, and since Intel "prohibits" CPU designers to introduce features which result in excessive increase of power consumption they presumably also benefit total power consumption.
Most CPUs feature a data forwarding network (store to load forwarding). There is no "store to load forwarding" in relation to code, simply because code is being modified much less frequently than data.
Code exhibits different patterns than data.
That said, most CPUs nowadays have unified L2 cache which holds both code and data. The reason for this is that having separate L2I and L2D caches would pointlessly consume the transistor budget while failing to deliver any measurable performance gains.
(Surely, the reason for having separate iCache and dCache isn't reduced complexity because if the reason was reduced complexity than there wouldn't be any pipelining in any of the current CPU designs. A CPU with pipelining is more complex than a CPU without pipelining. We want the increased complexity. The fact is: the next CPU design is (usually) more complex than the previous design.)
It has to do with which functional units of the CPU primarily access that cache. Since the ALU and FPU access the data cache which the decoder and scheduler access the instruction cache, and often pipelining allows the instruction processor and the execution unit to work simultaneously, using a single cache would cause contention between these two components. By separating them we lose some flexibility and gain the ability for these two major components of the processor to fetch data from cache simultaneously.
One reason is reduced complexity - you can implement a shared cache that can retrieve multiple lines at once, or just asynchronously (see Hit-Under-Miss), but it makes the cache controller far more complicated.
Another reason is execution stability - if you have a known amount of icache and dcache, caching of data cannot starve the cache system of instructions, which may occur in a simplistic shared cache.
And as Dan stated, having them separated makes pipelining easier, without adding to the controller complexity.
As processor's MEM and FETCH stages can access L1 cache(assume combined) simultaneously, there can be conflict as which one to give priority(can become performance bottleneck). One way to resolve this is to make L1 cache with two read ports. But increasing the number of ports increases the cache area quadratically and hence increased power consumption.
Also, if L1 cache is the combined one then there are chances that some data blocks might replace blocks containing instructions which were important and about to get accessed. These evictions and followed cache miss can hurt the overall performance.
Also, most of the time processor fetches instructions sequentially(few exceptions like taken targets, jumps etc) which gives instruction cache more spatial locality and hence good hit rate. Also, as mentioned in other answers, there are hardly any writes to the ICache(self-modifying code such as JIT compilers). So separate icache and dcache designs can be optimized considering their access patterns and other components like Load/store queues, write buffers etc.
There are generally 2 kinds of architectures 1. von neuman architecture and 2. the harward architecture. The harward architecture uses 2 separate memories. you can get more on this on this arm page http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3839.html

Resources