Is cache miss rate the only thing that matters when deciding which is better or performs better? - caching

If I have two different cache subsystem designs C1 and C2 that both have roughly the same hardware complexity, can I make a decision if which one is better choice considering effectiveness of cache subsystem is the prime factor i.e., the number misses should be minimized.
Give the total miss rate below:
miss_rate = (number of cache misses)/(number of cache reference)
miss rate of C1 = 0.77
miss rate of C2 = 0.73
Is the given miss rate information sufficient to make decision of what subsystem is better?

Yes, assuming hit latency is the same for both caches, actual miss rate on the workload you care about is the ultimate factor for that workload. It doesn't always generalize.
All differences in size, associativity, eviction policy, all matter because of their impact on miss rate on any given workload. Even cache line (block) size factors in to this: a cache with twice as many 32-byte lines vs. a cache with half as many 64-byte lines would be able to cache more scattered words, but pull in less nearby data on a miss. (Unless you have hardware prefetching, but again prefetch algorithms ultimately just affect miss rate.)
If hit and miss latencies are fixed, then all misses are equal and you just want fewer of them.
Well, not just latency, but overall effect on the pipeline, if the CPU isn't a simple in-order design from the 1980s that simply stalls on a miss. Which is what homework usually assumes, because otherwise the miss cost depends on details of the context, making it impossible to calculate performance based on just instruction mix, hit/miss rate, and miss costs.
An out-of-order exec CPU can hide the latency of some misses better than others. (On the critical path of some larger dependency chain vs. not.) Even an in-order CPU that can scoreboard loads can get work done in the shadow of a cache miss load, up until it reaches an instruction that reads the load result. (And with a store buffer, can usually hide store miss latency.) So miss penalty can differ depending on which loads miss, whether it's one that software instruction scheduling was able to hide more vs. less of the latency for. (If the independent work after a load includes other loads, then you'd need a non-blocking cache that handles hit-under-miss. Miss-under-miss to memory-level parallelism of multiple misses in flight also helps, as well as being able to get to a hit after 2 or more cache-miss loads.)
I think usually for most workloads with different cache geometries and sizes, there won't be significant bias towards more of the misses being easier to hide or not, so you could still say that miss-rate is the only thing that ultimately matters.
Miss-rate for a cache depends on workload, so you can't say that a lower miss rate on just one workload or trace makes it better on average or for all cases. e.g. an 8-way associative 16 KiB cache might have a higher hit rate than a 32 KiB 2-way cache on one workload (with a lot of conflict misses for the 2-way cache), but on a different workload where the working set is mostly one contiguous 24KiB array, the 32K 2-way cache might have a higher hit rate.

The term "better" is subjective as follows:
Hardware cost, in terms of silicon real-estate, meaning that a larger chip is more expensive to produce and thus costs more per chip.  (A larger cache may not even fit on the chip in question.)
Hardware cost, in terms of silicon process technology, meaning that a faster cache requires a more advanced chip process, so will increase costs per chip.
A miss rate on a given cache is workload specific (e.g. application specific or algorithm specific).  Thus, two different workloads may have different miss rates on each of the caches in question.  So, "better" here may mean across an average workload (or an average across several different workloads), but there's a lot of room for variability.
We would have to know the performance of the caches upon hit, and also upon miss — as a more complex cache with a higher hit rate might have longer timings.
In summary, in order to say that lower miss rate is better, we would have to know that all the other factors are equal.  Otherwise, the notion of better needs to be defined, perhaps to include cost/benefit definition.

Related

What are the trade-offs of larger cache memories ? Could we use one to replace secondary memory?

What are the disadvantages of using larger cache memories? Could we use just use a large enough cache memory so a secondary memory wouldn't be needed at all? I understand that the most compelling arguments are related to the cost of it / the problem of it's size. But if we assume that creating such a cache memory is possible, would it encounter any additional problems?
Many problems even if it was not expensive
Size will degrade the performance
Cache is fast because it’s very small compared to the main memory and hence it requires small amount of time to search it. If you build a large cache then it will not be able to perform at the same speed as the smaller counterpart.
Larger die area
Most of the DRAM chips only require a capacitor and a transistor to store a bit. SRAM on the other hand requires at least 6 transistors to make a single cell of memory. Which requires more area.
High power requirements
Because of the more transistors SRAM requires more power to operate. Which in turn generates more heat so you will have to handle the cooling problem.
So as you can see it’s not worth the effort given that today’s computers already achieve 90% hit ratio most of the time.

which is optimal a bigger block cache size or a smaller one?

Given a cache size with constant capacity and associativity, for a given code to determine average of array elements, would a cache with higher block size be preferred?
[from comments]
Examine the code given below to compute the average of an array:
total = 0;
for(j=0; j < k; j++) {
sub_total = 0; /* Nested loops to avoid overflow */
for(i=0; i < N; i++) {
sub_total += A[jN + i];
}
total += sub_total/N;
}
average = total/k;
Related: in the more general case of typical access patterns with some but limited spatial locality, larger lines help up to a point. These "Memory Hierarchy: Set-Associative Cache" (powerpoint) slides by Hong Jiang and/or Yifeng Zhu (U. Maine) have a graph of AMAT (Average Memory Access Time) vs. block size showing a curve, and also breaking it down into miss penalty vs. miss rate (for a simple model I think, for a simple in-order CPU that sucks at hiding memory latency. e.g. maybe not even pipelining multiple independent misses. (miss under miss))
There is a lot of good stuff in those slides, including a compiler-optimization section that mentions loop interchange (to fix nested loops with column-major vs. row-major order), and even cache-blocking for more reuse. A lot of stuff on the Internet is crap, but I looked through these slides and they have some solid info on how caches are designed and what the tradeoffs are. The performance-analysis stuff is only really accurate for simple CPUs, not like modern out-of-order CPUs that can overlap some computation with cache-miss latency so more shorter misses is different from fewer longer misses.
Specific answer to this question:
So the only workload you care about is a linear traversal of your elements? That makes cache line size nearly irrelevant for performance, assuming good hardware prefetching. (So larger lines mean less HW complexity and power usage for the same performance.)
With software prefetch, larger lines mean less prefetch overhead (although depending on the CPU design, that may not hurt performance if you still max out memory bandwidth.)
Without any prefetching, a larger line/block size would mean more hits following every demand-miss. A single traversal of an array has perfect spatial locality and no temporal locality. (Actually not quite perfect spatial locality at the start/end, if the array isn't aligned to the start of a cache line, and/or ends in the middle of a line.)
If a miss has to wait until the entire line is present in cache before the load that caused the miss can be satisfied, this slightly reduces the advantage of larger blocks. (But most of the latency of a cache miss is in the signalling and request overhead, not in waiting for the burst transfer to complete after it's already started.)
A larger block size means fewer requests in flight with the same bandwidth and latency, and limited concurrency is a real limiting factor in memory bandwidth in real CPUs. (See the latency-bound platforms part of this answer about x86 memory bandwidth: many-core Xeons with higher latency to L3 cache have lower single-threaded bandwidth than a dual or quad-core of the same clock speed. Each core only has 10 line-fill buffers to track outstanding L1 misses, and bandwidth = concurrency / latency.)
If your cache-miss handling has an early restart design, even that bit of extra latency can be avoided. (That's very common, but Paul says theoretically possible to not have it in a CPU design). The load that caused the miss gets its data as soon as it arrives. The rest of the cache line fill happens "in the background", and hopefully later loads can also be satisfied from the partially-received cache line.
Critical word first is a related feature, where the needed word is sent first (for use with early restart), and the burst transfer then wraps around to transfer the earlier words of the block. In this case, the critical word will always be the first word, so no special hardware support is needed beyond early restart. (The U. Maine slides I linked above mention early restart / critical word first and point out that it decreases the miss penalty for large cache lines.)
An out-of-order execution CPU (or software pipelining on an in-order CPU) could give you the equivalent of HW prefetch by having multiple demand-misses outstanding at once. If the CPU "sees" the loads to another cache line while a miss to the current cache line is still outstanding, the demand-misses can be pipelined, again hiding some of the difference between larger or smaller lines.
If lines are too small, you'll run into a limit on how many outstanding misses for different lines your L1D can track. With larger lines or smaller out-of-order windows, you might have some "slack" when there's no outstanding request for the next cache line, so you're not maxing out the bandwidth. And you pay for it with bubbles in the pipeline when you get to the end of a cache line and the start of the next line hasn't arrived yet, because it started too late (while ALU execution units were using data from too close to the end of the current cache line.)
Related: these slides don't say much about the tradeoff of larger vs. smaller lines, but look pretty good.
The simplistic answer is that larger cache blocks would be preferred since the workload has no (data) temporal locality (no data reuse), perfect spacial locality (excluding the potentially inadequate alignment of the array for the first block and insufficient size of the array for the last block, every part of every block of data will be used), and a single access stream (no potential for conflict misses).
A more nuanced answer would consider the size and alignment of the array (the fraction of the first and last cache blocks that will be unused and what fraction of the memory transfer time that represents; for a 1 GiB array, even 4 KiB blocks would waste less than 0.0008% of the memory bandwidth), the ability of the system to use critical word first (if the array is of modest size and there is no support for early use of data as it becomes available rather than waiting for the entire block to be filled, then the start-up overhead will remove much of the prefetching advantage of larger cache blocks), the use of prefetching (software or hardware prefetching reduces the benefit of large cache blocks and this workload is extremely friendly to prefetching), the configuration of the memory system (e.g., using DRAM with an immediate page close controller policy would increase the benefit of larger cache blocks because each access would involve a row activate and row close, often to the same DRAM bank preventing latency overlap), whether the same block size is used for instructions and page table accesses and whether these accesses share the cache (instruction accesses provide a second "stream" which can introduce conflict misses; with shared caching of a two-level hierarchical page table TLB misses would access two cache blocks), whether simple way prediction is used (a larger block would increase prediction accuracy reducing misprediction overhead), and perhaps other factors.
From your example code we can't say either way as long as the hardware pre-fetcher can keep up a memory stream at maximum memory throughput.
In a random access scenario a shorter cache-line might be preferable as you then don't need to fill all the line. But the total amount of cached memory would go down as you need more circuits for tags and potentially more time for comparing.
So a compromise must be made Intel has chosen 64-bytes per line (and fetches 2 lines) others has chosen 32-bytes per line.

What is the semantics for Super Queue and Line Fill buffers?

I am asking this question regarding Haswell Microarchitetcure(Intel Xeon E5-2640-v3 CPU). From the specifications of the CPU and other resources I found out that there are 10 LFBs and Size of the super queue is 16. I have two questions related to LFBs and SuperQueues:
1) What will be the maximum degree of memory level parallelism the system can provide, 10 or 16(LFBs or SQ)?
2) According to some sources every L1D miss is recorded in SQ and then SQ assigns the Line fill buffer and at some other sources they have written that SQ and LFBs can work independently. Could you please explain the working of SQ in brief?
Here is the example figure(Not for Haswell) for SQ and LFB.
References:
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
http://www.realworldtech.com/haswell-cpu/
For (1) logically the maximum parallelism would be limited by the least-parallel part of the pipeline which is the 10 LFBs, and this is probably strictly true for demand-load parallelism when prefetching is disabled or can't help. In practice, everything is more complicated once your load is at least partly helped by prefetching, since then the wider queues between L2 and RAM can be used which could make the observed parallelism greater than 10. The most practical approach is probably direct measurement: given measured latency to RAM, and observed throughput, you can calculate an effective parallelism for any particular load.
For (2) my understanding is that it is the other way around: all demand misses in L1 first allocate into the LFB (unless of course they hit an existing LFB) and may involve the "superqueue" later (or whatever it is called these days) if they also miss higher in the cache hierarchy. The diagram you included seems to confirm that: the only path from the L1 is through the LFB queue.

CPU bound vs Cache bound - Can instructions be executed without cache/memory access? Can memory access be as fast as instruction execution?

I was looking up the difference between CPU bound and IO bound programs. That was when I came across answers that explain that there are other variants like Memory Bound, Cache bound, etc.
I understand how Memory Bound (Multiplication of 2 large matrices in Main Memory) and IO Bound (grep) differ from each other and from CPU bound/Cache bound.
However, the difference between CPU Bound programs and IO Bound programs doesn't seem as clear. Here is what I gathered :
Cache bound - Speed of cache access is an important factor in deciding the speed at which the program gets executed. For example, if the most visited part of a program is a small chunk of code inside a loop small enough to be contained within the cache, then the program may be cache bound.
CPU bound - The speed at which CPU executes instructions is an important factor in deciding the speed at which the program gets executed.
But how can processes be CPU bound? I mean, instructions need to be fetched before execution (from cache/ Main Memory) every time, so, no matter how fast the CPU is, it will have to wait for the cache to finish data transfer and thus will at least be Cache Bound or Memory bound, since memory access is slower than instruction execution.
So is CPU bound the same as cache bound?
CPU architecture is very much like plumbing, just without the smell. When one of the pipes gets clogged, some others will overflow, while others will remain empty - both cases are bad utilization, but you need to find the jam to release everything.
Similarly, with a CPU you have multiple systems that need to work in unison to make the program progress. Each of these machines has an upper limit on the bandwidth it can work, and when it's reached - it will become a limitation, making the other systems underutilized or even stalled.
The main memory for example depends on the number of channels and the type of DRAM (and of course frequency), but let's say it commonly peaks at 25G/s in client CPUs. that means that any workload that tries to consume data beyond this rate, will become blocked by the memory BW (i.e. memory bound), and the rest of the systems will be underutilized.
Cache BW depends on the cache level (and the processor micro-architecture, and of course frequency of that cache domain), but you can find out where it peaks in the optimization guides.
According to 2.1.3 here, Intel Skylake for example provides 2 32B loads + 1 store per cycle from the L1 (though the actual utilization they quote is a little lower, probably due to collisions or writeback interference), L2 is effectively about 1/2 line per cycle and L3 a little less than 1/3. This means that if your data set is contained in one of these levels, you can reach that peak BW before being capped by that cache.
On the other hand, let's say you don't reach the peak cache bandwidth, instead consuming data from the L1 at a lower rate, but each element of data requires many complicated mathematical operations. In that case, you may be bounded by your execution bandwidth - more so if these operations are limited to only part of the execution ports (as is the case with some esoteric operations).
There are useful tools to determine what you're bounded by - look up TopDown analysis for example

Is memory latency affected by CPU frequency? Is it a result of memory power management by the memory controller?

I basically need some help to explain/confirm some experimental results.
Basic Theory
A common idea expressed in papers on DVFS is that execution times have on-chip and off-chip components. On-chip components of execution time scale linearly with CPU frequency whereas the off-chip components remain unaffected.
Therefore, for CPU-bound applications, there is a linear relationship between CPU frequency and instruction-retirement rate. On the other hand, for a memory bound application where the caches are often missed and DRAM has to be accessed frequently, the relationship should be affine (one is not just a multiple of the other, you also have to add a constant).
Experiment
I was doing experiments looking at how CPU frequency affects instruction-retirement rate and execution time under different levels of memory-boundedness.
I wrote a test application in C that traverses a linked list. I effectively create a linked list whose individual nodes have sizes equal to the size of a cache-line (64 bytes). I allocated a large amount of memory that is a multiple of the cache-line size.
The linked list is circular such that the last element links to the first element. Also, this linked list randomly traverses through the cache-line sized blocks in the allocated memory. Every cache-line sized block in the allocated memory is accessed, and no block is accessed more than once.
Because of the random traversal, I assumed it should not be possible for the hardware to use any pre-fetching. Basically, by traversing the list, you have a sequence of memory accesses with no stride pattern, no temporal locality, and no spacial locality. Also, because this is a linked list, one memory access can not begin until the previous one completes. Therefore, the memory accesses should not be parallelizable.
When the amount of allocated memory is small enough, you should have no cache misses beyond initial warm up. In this case, the workload is effectively CPU bound and the instruction-retirement rate scales very cleanly with CPU frequency.
When the amount of allocated memory is large enough (bigger than the LLC), you should be missing the caches. The workload is memory bound and the instruction-retirement rate should not scale as well with CPU frequency.
The basic experimental setup is similiar to the one described here:
"Actual CPU Frequency vs CPU Frequency Reported by the Linux "cpufreq" Subsystem".
The above application is run repeatedly for some duration. At the start and end of the duration, the hardware performance counter is sampled to determine the number of instructions retired over the duration. The length of the duration is measured as well. The average instruction-retirement rate is measured as the ratio between these two values.
This experiment is repeated across all the possible CPU frequency settings using the "userspace" CPU-frequency governor in Linux. Also, the experiment is repeated for the CPU-bound case and the memory-bound case as described above.
Results
The two following plots show results for the CPU-bound case and memory-bound case respectively. On the x-axis, the CPU clock frequency is specified in GHz. On the y-axis, the instruction-retirement rate is specified in (1/ns).
A marker is placed for repetition of the experiment described above. The line shows what the result would be if instruction-retirement rate increased at the same rate as CPU frequency and passed through the lowest-frequency marker.
Results for the CPU-bound case.
Results for the memory-bound case.
The results make sense for the CPU-bound case, but not as much for the memory-bound case. All the markers for the memory-bound fall below the line which is expected because the instruction-retirement rate should not increase at the same rate as CPU frequency for a memory-bound application. The markers appear to fall on straight lines, which is also expected.
However, there appears to be step-changes in the instruction-retirement rate with change in CPU frequency.
Question
What is causing the step changes in the instruction-retirement rate? The only explanation I could think of is that the memory controller is somehow changing the speed and power-consumption of memory with changes in the rate of memory requests. (As instruction-retirement rate increases, the rate of memory requests should increase as well.) Is this a correct explanation?
You seem to have exactly the results you expected - a roughly linear trend for the cpu bound program, and a shallow(er) affine one for the memory bound case (which is less cpu effected). You will need a lot more data to determine if they are consistent steps or if they are - as I suspect - mostly random jitter depending on how 'good' the list is.
The cpu clock will affect bus clocks, which will affect timings and so on - synchronisation between differently clocked buses is always challenging for hardware designers. The spacing of your steps is interestingly 400 Mhz but I wouldn't draw too much from this - generally, this kind of stuff is way too complex and specific-hardware dependent to be properly analysed without 'inside' knowledge the memory controller used, etc.
(please draw nicer lines of best fit)

Resources