How could cache improve the performance of a pipeline processor? - performance

I understand that accessing cache is much faster than accessing the main memory and I have a basic idea of all those miss rate and miss penalty stuff.
But this just came across my mind : how could cache be useful in a pipeline processor?
From my understanding, the time a single clock cycle takes is lower bounded by the longest time taken among all the processes. Like if accessing cache takes 1n, accessing main memory takes 10n, then the clock cycle time should be at least greater than 10n. Otherwise that task could not be completed when needed.. Then even the cache accessing is completed, the instruction still have to wait there until next clock cycle.
I was imaging a basic 5 stage pipeline process which includes instruction fetching, decoding, execution, memory accessing and write back.
Am I completely misunderstanding something? Or maybe in reality we have a much complex pipeline, where memory accessing is broken down to several pieces like cache checking and main memory accessing so that if we get an hit we can somehow skip the next cycle? But there will be a problem too if previous instruction didn't skip a cycle while the current instruction does...
I am scratching my head off... Any explanation would be highly appreciated!

The cycle time is not lower bounded by the longest time take among all processes.
Actually, a RAM access can take hundreds of cycles.
There are different processor architectures, but typical numbers might be:
1 cycle to access a register.
4 cycles to access L1 cache.
10 cycles to access L2 cache.
75 cycles to access L3 cache.
hundreds of cycles to access main memory.
In the extreme case, if a computation is memory-intensive, and constantly missing the cache, the CPU will be very under-utilized, as it requests to fetch data from memory and waits until the data is available. On the other hand, if an algorithm needs to repeatedly access the same region of memory that fits entirely in L1 cache (like inverting a matrix that is not too big), the CPU will be much better utilized: The algorithm will start by fetching the data into cache, and the rest of the algorithm will just use the cache to read and write data values.

Related

which is optimal a bigger block cache size or a smaller one?

Given a cache size with constant capacity and associativity, for a given code to determine average of array elements, would a cache with higher block size be preferred?
[from comments]
Examine the code given below to compute the average of an array:
total = 0;
for(j=0; j < k; j++) {
sub_total = 0; /* Nested loops to avoid overflow */
for(i=0; i < N; i++) {
sub_total += A[jN + i];
}
total += sub_total/N;
}
average = total/k;
Related: in the more general case of typical access patterns with some but limited spatial locality, larger lines help up to a point. These "Memory Hierarchy: Set-Associative Cache" (powerpoint) slides by Hong Jiang and/or Yifeng Zhu (U. Maine) have a graph of AMAT (Average Memory Access Time) vs. block size showing a curve, and also breaking it down into miss penalty vs. miss rate (for a simple model I think, for a simple in-order CPU that sucks at hiding memory latency. e.g. maybe not even pipelining multiple independent misses. (miss under miss))
There is a lot of good stuff in those slides, including a compiler-optimization section that mentions loop interchange (to fix nested loops with column-major vs. row-major order), and even cache-blocking for more reuse. A lot of stuff on the Internet is crap, but I looked through these slides and they have some solid info on how caches are designed and what the tradeoffs are. The performance-analysis stuff is only really accurate for simple CPUs, not like modern out-of-order CPUs that can overlap some computation with cache-miss latency so more shorter misses is different from fewer longer misses.
Specific answer to this question:
So the only workload you care about is a linear traversal of your elements? That makes cache line size nearly irrelevant for performance, assuming good hardware prefetching. (So larger lines mean less HW complexity and power usage for the same performance.)
With software prefetch, larger lines mean less prefetch overhead (although depending on the CPU design, that may not hurt performance if you still max out memory bandwidth.)
Without any prefetching, a larger line/block size would mean more hits following every demand-miss. A single traversal of an array has perfect spatial locality and no temporal locality. (Actually not quite perfect spatial locality at the start/end, if the array isn't aligned to the start of a cache line, and/or ends in the middle of a line.)
If a miss has to wait until the entire line is present in cache before the load that caused the miss can be satisfied, this slightly reduces the advantage of larger blocks. (But most of the latency of a cache miss is in the signalling and request overhead, not in waiting for the burst transfer to complete after it's already started.)
A larger block size means fewer requests in flight with the same bandwidth and latency, and limited concurrency is a real limiting factor in memory bandwidth in real CPUs. (See the latency-bound platforms part of this answer about x86 memory bandwidth: many-core Xeons with higher latency to L3 cache have lower single-threaded bandwidth than a dual or quad-core of the same clock speed. Each core only has 10 line-fill buffers to track outstanding L1 misses, and bandwidth = concurrency / latency.)
If your cache-miss handling has an early restart design, even that bit of extra latency can be avoided. (That's very common, but Paul says theoretically possible to not have it in a CPU design). The load that caused the miss gets its data as soon as it arrives. The rest of the cache line fill happens "in the background", and hopefully later loads can also be satisfied from the partially-received cache line.
Critical word first is a related feature, where the needed word is sent first (for use with early restart), and the burst transfer then wraps around to transfer the earlier words of the block. In this case, the critical word will always be the first word, so no special hardware support is needed beyond early restart. (The U. Maine slides I linked above mention early restart / critical word first and point out that it decreases the miss penalty for large cache lines.)
An out-of-order execution CPU (or software pipelining on an in-order CPU) could give you the equivalent of HW prefetch by having multiple demand-misses outstanding at once. If the CPU "sees" the loads to another cache line while a miss to the current cache line is still outstanding, the demand-misses can be pipelined, again hiding some of the difference between larger or smaller lines.
If lines are too small, you'll run into a limit on how many outstanding misses for different lines your L1D can track. With larger lines or smaller out-of-order windows, you might have some "slack" when there's no outstanding request for the next cache line, so you're not maxing out the bandwidth. And you pay for it with bubbles in the pipeline when you get to the end of a cache line and the start of the next line hasn't arrived yet, because it started too late (while ALU execution units were using data from too close to the end of the current cache line.)
Related: these slides don't say much about the tradeoff of larger vs. smaller lines, but look pretty good.
The simplistic answer is that larger cache blocks would be preferred since the workload has no (data) temporal locality (no data reuse), perfect spacial locality (excluding the potentially inadequate alignment of the array for the first block and insufficient size of the array for the last block, every part of every block of data will be used), and a single access stream (no potential for conflict misses).
A more nuanced answer would consider the size and alignment of the array (the fraction of the first and last cache blocks that will be unused and what fraction of the memory transfer time that represents; for a 1 GiB array, even 4 KiB blocks would waste less than 0.0008% of the memory bandwidth), the ability of the system to use critical word first (if the array is of modest size and there is no support for early use of data as it becomes available rather than waiting for the entire block to be filled, then the start-up overhead will remove much of the prefetching advantage of larger cache blocks), the use of prefetching (software or hardware prefetching reduces the benefit of large cache blocks and this workload is extremely friendly to prefetching), the configuration of the memory system (e.g., using DRAM with an immediate page close controller policy would increase the benefit of larger cache blocks because each access would involve a row activate and row close, often to the same DRAM bank preventing latency overlap), whether the same block size is used for instructions and page table accesses and whether these accesses share the cache (instruction accesses provide a second "stream" which can introduce conflict misses; with shared caching of a two-level hierarchical page table TLB misses would access two cache blocks), whether simple way prediction is used (a larger block would increase prediction accuracy reducing misprediction overhead), and perhaps other factors.
From your example code we can't say either way as long as the hardware pre-fetcher can keep up a memory stream at maximum memory throughput.
In a random access scenario a shorter cache-line might be preferable as you then don't need to fill all the line. But the total amount of cached memory would go down as you need more circuits for tags and potentially more time for comparing.
So a compromise must be made Intel has chosen 64-bytes per line (and fetches 2 lines) others has chosen 32-bytes per line.

CPU bound vs Cache bound - Can instructions be executed without cache/memory access? Can memory access be as fast as instruction execution?

I was looking up the difference between CPU bound and IO bound programs. That was when I came across answers that explain that there are other variants like Memory Bound, Cache bound, etc.
I understand how Memory Bound (Multiplication of 2 large matrices in Main Memory) and IO Bound (grep) differ from each other and from CPU bound/Cache bound.
However, the difference between CPU Bound programs and IO Bound programs doesn't seem as clear. Here is what I gathered :
Cache bound - Speed of cache access is an important factor in deciding the speed at which the program gets executed. For example, if the most visited part of a program is a small chunk of code inside a loop small enough to be contained within the cache, then the program may be cache bound.
CPU bound - The speed at which CPU executes instructions is an important factor in deciding the speed at which the program gets executed.
But how can processes be CPU bound? I mean, instructions need to be fetched before execution (from cache/ Main Memory) every time, so, no matter how fast the CPU is, it will have to wait for the cache to finish data transfer and thus will at least be Cache Bound or Memory bound, since memory access is slower than instruction execution.
So is CPU bound the same as cache bound?
CPU architecture is very much like plumbing, just without the smell. When one of the pipes gets clogged, some others will overflow, while others will remain empty - both cases are bad utilization, but you need to find the jam to release everything.
Similarly, with a CPU you have multiple systems that need to work in unison to make the program progress. Each of these machines has an upper limit on the bandwidth it can work, and when it's reached - it will become a limitation, making the other systems underutilized or even stalled.
The main memory for example depends on the number of channels and the type of DRAM (and of course frequency), but let's say it commonly peaks at 25G/s in client CPUs. that means that any workload that tries to consume data beyond this rate, will become blocked by the memory BW (i.e. memory bound), and the rest of the systems will be underutilized.
Cache BW depends on the cache level (and the processor micro-architecture, and of course frequency of that cache domain), but you can find out where it peaks in the optimization guides.
According to 2.1.3 here, Intel Skylake for example provides 2 32B loads + 1 store per cycle from the L1 (though the actual utilization they quote is a little lower, probably due to collisions or writeback interference), L2 is effectively about 1/2 line per cycle and L3 a little less than 1/3. This means that if your data set is contained in one of these levels, you can reach that peak BW before being capped by that cache.
On the other hand, let's say you don't reach the peak cache bandwidth, instead consuming data from the L1 at a lower rate, but each element of data requires many complicated mathematical operations. In that case, you may be bounded by your execution bandwidth - more so if these operations are limited to only part of the execution ports (as is the case with some esoteric operations).
There are useful tools to determine what you're bounded by - look up TopDown analysis for example

Implementing LRU with timestamp: How expensive is memory store and load?

I'm talking about LRU memory page replacement algorithm implement in C, NOT in Java or C++.
According to the OS course notes:
OK, so how do we actually implement a LRU? Idea 1): mark everything we touch with a timestamp.
Whenever we need to evict a page, we select the oldest page (=least-recently used). It turns out that this
simple idea is not so good. Why? Because for every memory load, we would have to read contents of the
clock and perform a memory store! So it is clear that keeping timestamps would make the computer at
least twice as slow. I
Memory load and store operation should be very fast. Is it really necessary to get rid of these little tiny operations?
In the case of memory replacement, the overhead of loading page from disk should be a lot more significant than memory operations. Why would actually care about memory store and load?
If what the notes said isn't correct, then what is the real problem with implementing LRU with timestamp?
EDIT:
As I dig deeper, the reason I can think of is like the following. These memory store and load operations happen when there is a page hit. In this case, we are not loading page from disks, so the comparison is not valid.
Since the hit rate is expected to be very high, so updating the data structure associated with LRU should be very frequent. That's why we care about the operations repeated in the udpate process, e.g., memory load and store.
But still, I'm not convincing how significant the overhead is to do memory load and store. There should be some measurements around. Can someone point me to them? Thanks!
Memory load and store operations can be quite fast, but in most real life cases the memory subsystem is slower - sometimes much slower - than the CPU's execution engine.
Rough numbers for memory access times:
L1 cache hit: 2-4 CPU cycles
L2 cache hit: 10-20 CPU cycles
L3 cache hit: 50 CPU cycles
Main memory access: 100-200 CPU cycles
So it costs real time to do loads and stores. With LRU, every regular memory access will also incur the cost of a memory store operation. This alone doubles the number of memory accesses the CPU does. In most situations this will slow the program execution. In addition, on a page eviction all the timestamps will need to be read. This will be quite slow.
In addition, reading and storing the timestamps constantly means they will be taking up space in the L1 or L2 caches. Space in these caches is limited, so your cache miss rate for other accesses will probably be higher, which will cost more time.
In short - LRU is quite expensive.

Cache or Registers - which is faster?

I'm sorry if this is the wrong place to ask this but I've searched and always found different answer. My question is:
Which is faster? Cache or CPU Registers?
According to me, the registers are what directly load data to execute it while the cache is just a storage place close or internally in the CPU.
Here are the sources I found that confuses me:
2 for cache | 1 for registers
http://in.answers.yahoo.com/question/index?qid=20110503030537AAzmDGp
Cache is faster.
http://wiki.answers.com/Q/Is_cache_memory_faster_than_CPU_registers
So which really is it?
CPU register is always faster than the L1 cache. It is the closest. The difference is roughly a factor of 3.
Trying to make this as intuitive as possible without getting lost in the physics underlying the question: there is a simple correlation between speed and distance in electronics. The further you make a signal travel, the harder it gets to get that signal to the other end of the wire without the signal getting corrupted. It is the "there is no free lunch" principle of electronic design.
The corollary is that bigger is slower. Because if you make something bigger then inevitably the distances are going to get larger. Something that was automatic for a while, shrinking the feature size on the chip automatically produced a faster processor.
The register file in a processor is small and sits physically close to the execution engine. The furthest removed from the processor is the RAM. You can pop the case and actually see the wires between the two. In between sit the caches, designed to bridge the dramatic gap between the speed of those two opposites. Every processor has an L1 cache, relatively small (32 KB typically) and located closest to the core. Further down is the L2 cache, relatively big (4 MB typically) and located further from the core. More expensive processors also have an L3 cache, bigger and further away.
Specifically on x86 architecture:
Reading from register has 0 or 1 cycle latency.
Writing to registers has 0 cycle latency.
Reading/Writing L1 cache has a 3 to 5 cycle latency (varies by architecture age)
Actual load/store requests may execute within 0 or 1 cycles due to write-back buffer and store-forwarding features (details below)
Reading from register can have a 1 cycle latency on Intel Core 2 CPUs (and earlier models) due to its design: If enough simultaneously-executing instructions are reading from different registers, the CPU's register bank will be unable to service all the requests in a single cycle. This design limitation isn't present in any x86 chip that's been put on the consumer market since 2010 (but it is present in some 2010/11-released Xeon chips).
L1 cache latencies are fixed per-model but tend to get slower as you go back in time to older models. However, keep in mind three things:
x86 chips these days have a write-back cache that has a 0 cycle latency. When you store a value to memory it falls into that cache, and the instruction is able to retire in a single cycle. Memory latency then only becomes visible if you issue enough consecutive writes to fill the write-back cache. Writeback caches have been prominent in desktop chip design since about 2001, but was widely missing from the ARM-based mobile chip markets until much more recently.
x86 chips these days have store forwarding from the write-back cache. If you store an address to the WB cache and then read back the same address several instructions later, the CPU will fetch the value from the WB cache instead of accessing L1 memory for it. This reduces the visible latency on what appears to be an L1 request to 1 cycle. But in fact, the L1 isn't be referenced at all in that case. Store forwarding also has some other rules for it to work properly, which also vary a lot across the various CPUs available on the market today (typically requiring 128-bit address alignment and matched operand size).
The store forwarding feature can generate false positives where-in the CPU thinks the address is in the writeback buffer based on a fast partial-bits check (usually 10-14 bits, depending on chip). It uses an extra cycle to verify with a full check. If that fails then the CPU must re-route as a regular memory request. This miss can add an extra 1-2 cycles latency to qualifying L1 cache accesses. In my measurements, store forwarding misses happen quite often on AMD's Bulldozer, for example; enough so that its L1 cache latency over-time is about 10-15% higher than its documented 3-cycles. It is almost a non-factor on Intel's Core series.
Primary reference: http://www.agner.org/optimize/ and specifically http://www.agner.org/optimize/microarchitecture.pdf
And then manually graph info from that with the tables on architectures, models, and release dates from the various List of CPUs pages on wikipedia.

Instruction cache for pipelined simulator

I am trying to complete a simulator based for a simplified mips computer using java. I believe I have completed the pipeline logic needed for my assignment but I am having a hard time understanding what the instruction and data caches are supposed to do.
The instruction cache should be direct-mapped with 4 blocks and the block size is 4 words.
So I am really confused on what the cache is doing. Is it going to memory and pulling the instruction from memory? For example, in one block it will have just the add command.
Would it make sense to implement it as a 2 dimensional array?
First you should know the basics of the cache. You can imagine cache as an intermediate memory which sits between the DRAM or main memory and your processor, however very much limited in size. Now, when you try to access a location in memory, you will search it first in the cache. If it is found (cache hit) the processor will take this data and resume the execution. Generally the cache hit is supposed to be very few clock cycles lets say 1 or 2. Suppose if the data is not found in the cache (cache miss), then the data is fetched from the main memory, filled in the cache and fed to the processor. The processor blocks till the data is fetched. This takes few hundreds of clock cycles normally depending on the DRAM you are using. The amount of data that is fetched from DRAM is equal to cacheline size. For that you should search for spatial locality of reference in caches.
I think this should get you a start.

Resources