I am asking this question regarding Haswell Microarchitetcure(Intel Xeon E5-2640-v3 CPU). From the specifications of the CPU and other resources I found out that there are 10 LFBs and Size of the super queue is 16. I have two questions related to LFBs and SuperQueues:
1) What will be the maximum degree of memory level parallelism the system can provide, 10 or 16(LFBs or SQ)?
2) According to some sources every L1D miss is recorded in SQ and then SQ assigns the Line fill buffer and at some other sources they have written that SQ and LFBs can work independently. Could you please explain the working of SQ in brief?
Here is the example figure(Not for Haswell) for SQ and LFB.
References:
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
http://www.realworldtech.com/haswell-cpu/
For (1) logically the maximum parallelism would be limited by the least-parallel part of the pipeline which is the 10 LFBs, and this is probably strictly true for demand-load parallelism when prefetching is disabled or can't help. In practice, everything is more complicated once your load is at least partly helped by prefetching, since then the wider queues between L2 and RAM can be used which could make the observed parallelism greater than 10. The most practical approach is probably direct measurement: given measured latency to RAM, and observed throughput, you can calculate an effective parallelism for any particular load.
For (2) my understanding is that it is the other way around: all demand misses in L1 first allocate into the LFB (unless of course they hit an existing LFB) and may involve the "superqueue" later (or whatever it is called these days) if they also miss higher in the cache hierarchy. The diagram you included seems to confirm that: the only path from the L1 is through the LFB queue.
Related
If I have two different cache subsystem designs C1 and C2 that both have roughly the same hardware complexity, can I make a decision if which one is better choice considering effectiveness of cache subsystem is the prime factor i.e., the number misses should be minimized.
Give the total miss rate below:
miss_rate = (number of cache misses)/(number of cache reference)
miss rate of C1 = 0.77
miss rate of C2 = 0.73
Is the given miss rate information sufficient to make decision of what subsystem is better?
Yes, assuming hit latency is the same for both caches, actual miss rate on the workload you care about is the ultimate factor for that workload. It doesn't always generalize.
All differences in size, associativity, eviction policy, all matter because of their impact on miss rate on any given workload. Even cache line (block) size factors in to this: a cache with twice as many 32-byte lines vs. a cache with half as many 64-byte lines would be able to cache more scattered words, but pull in less nearby data on a miss. (Unless you have hardware prefetching, but again prefetch algorithms ultimately just affect miss rate.)
If hit and miss latencies are fixed, then all misses are equal and you just want fewer of them.
Well, not just latency, but overall effect on the pipeline, if the CPU isn't a simple in-order design from the 1980s that simply stalls on a miss. Which is what homework usually assumes, because otherwise the miss cost depends on details of the context, making it impossible to calculate performance based on just instruction mix, hit/miss rate, and miss costs.
An out-of-order exec CPU can hide the latency of some misses better than others. (On the critical path of some larger dependency chain vs. not.) Even an in-order CPU that can scoreboard loads can get work done in the shadow of a cache miss load, up until it reaches an instruction that reads the load result. (And with a store buffer, can usually hide store miss latency.) So miss penalty can differ depending on which loads miss, whether it's one that software instruction scheduling was able to hide more vs. less of the latency for. (If the independent work after a load includes other loads, then you'd need a non-blocking cache that handles hit-under-miss. Miss-under-miss to memory-level parallelism of multiple misses in flight also helps, as well as being able to get to a hit after 2 or more cache-miss loads.)
I think usually for most workloads with different cache geometries and sizes, there won't be significant bias towards more of the misses being easier to hide or not, so you could still say that miss-rate is the only thing that ultimately matters.
Miss-rate for a cache depends on workload, so you can't say that a lower miss rate on just one workload or trace makes it better on average or for all cases. e.g. an 8-way associative 16 KiB cache might have a higher hit rate than a 32 KiB 2-way cache on one workload (with a lot of conflict misses for the 2-way cache), but on a different workload where the working set is mostly one contiguous 24KiB array, the 32K 2-way cache might have a higher hit rate.
The term "better" is subjective as follows:
Hardware cost, in terms of silicon real-estate, meaning that a larger chip is more expensive to produce and thus costs more per chip. (A larger cache may not even fit on the chip in question.)
Hardware cost, in terms of silicon process technology, meaning that a faster cache requires a more advanced chip process, so will increase costs per chip.
A miss rate on a given cache is workload specific (e.g. application specific or algorithm specific). Thus, two different workloads may have different miss rates on each of the caches in question. So, "better" here may mean across an average workload (or an average across several different workloads), but there's a lot of room for variability.
We would have to know the performance of the caches upon hit, and also upon miss — as a more complex cache with a higher hit rate might have longer timings.
In summary, in order to say that lower miss rate is better, we would have to know that all the other factors are equal. Otherwise, the notion of better needs to be defined, perhaps to include cost/benefit definition.
I have an x86-64 Linux program which I am attempting to optimize via perf. The perf report shows the hottest instructions are scalar conversions from double to long with a memory argument, for example:
cvttsd2si (%rax, %rdi, 8), %rcx
which corresponds to C code like:
extern double *array;
long val = (long)array[idx];
(This is an unusual bottleneck but the code itself is very unusual.)
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question? What other data should I collect and how should I proceed to optimize this?
Some things I have looked at already. CPU counter results show 1.5% cache misses per instruction:
30686845493287 cache-references
2140314044614 cache-misses # 6.975 % of all cache refs
52970546 page-faults
1244774326560850 instructions
194784478522820 branches
2573746947618 branch-misses # 1.32% of all branches
52970546 faults
Top-down performance monitors show we are primarily backend-bound:
frontend bound retiring bad speculation backend bound
10.1% 25.9% 4.5% 59.5%
Ad-hoc measurement with top shows all CPUs pegged at 100% suggesting we are not waiting on memory.
A final note of interest: when run on AWS EC2, the code is dramatically slower (44%) on AMD vs Intel with the same core count. (Tested on Ice Lake 8375C vs EPYC 7R13). What could explain this discrepancy?
Thank you for any help.
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question?
I think there is two main reason for this instruction to be slow. 1. There is a dependency chain and the latency of this instruction is a problem since the processor is waiting on it to execute other instructions. 2. There is a cache miss (saturating the memory with such instruction is improbable unless many cores are doing memory-based operations).
First of all, tracking what is going on on a specific instruction is hard (especially if the instruction is not executed a lot of time). You need to use precise events to track the root of the problem, that is, events for which the exact instruction addresses that caused the event are available. Only a (small) subset of all events are precise one.
Regarding (1), the latency of the instruction should be about 12 cycles on both architecture (although it might be slightly more on the AMD processor, I do not expect a 44% difference). The target processor are able to execute multiple instruction at the same time in a given cycle. Instructions are executed on different port and are also pipelined. The port usage matters to understand what is going on. This means all the instruction in the hot loop matters. You cannot isolate this specific instruction. Modern processors are insanely complex so a basic analysis can be tricky. On Ice Lake processors, you can measure the average port usage with events like UOPS_DISPATCHED.PORT_XXX where XXX can be 0, 1, 2_3, 4_9, 5, 6, 7_8. Only the first three matters for this instruction. The EXE_ACTIVITY.XXX events may also be useful. You should check if a port is saturated and which one. AFAIK, none of these events are precise so you can only analyse a block of code (typically the hot loop). On Zen 3, the ports are FP23 and FP45. IDK what are the useful events on this architecture (I am not very familiar with it).
On Ice Lake, you can check the FRONTEND_RETIRED.LATENCY_GE_XXX events where XXX is a power of two integer (which should be precise one so you can see if this instruction is the one impacting the events). This help you to see whether the front-end or the back-end is the limiting factor.
Regarding (2), you can check the latency of the memory accesses as well as the number of L1/L2/L3 cache hits/misses. On Ice Lake, you can use events like MEM_LOAD_RETIRED.XXX where XXX can be for example L1_MISS L1_HIT, L2_MISS, L2_HIT, L3_MISS and L3_HIT. Still on Ice Lake, t may be useful to track the latency of the memory operation with MEM_TRANS_RETIRED.LOAD_LATENCY_GT_XXX where XXX is again a power of two integer.
You can also use LLVM-MCA to simulate the scheduling of the loop instruction statically on the target architecture (do not consider branches). This is very useful to understand deeply what the scheduler can do pretty easily.
What could explain this discrepancy?
The latency and reciprocal throughput should be about the same on the two platform or at least close. That being said, for the same core count, the two certainly do not operate at the same frequency. If this is not coming from that, then I doubt this instruction is actually the problem alone (tricky scheduling issues, wrong/inaccurate profiling results, etc.).
CPU counter results show 1.5% cache misses per instruction
The thing is the cache-misses event is certainly not very informative here. Indeed, it references the last-level cache (L3) misses. Thus, it does not give any information about the L1/L2 misses (previous events do).
how should I proceed to optimize this?
If the code is latency bound, the solution is to first break any dependency chain in this loop. Unrolling the loop dans rewriting it so to make it more SIMD-friendly can help a lot to improve performance (the reciprocal throughput of this instruction is about 1 cycle as opposed to 12 for the latency so there is a room for improvements in this case).
If the code is memory bound, they you should care about data locality. Data should fit in the L1 cache if possible. There are many tips to do so but it is hard to guide you without more context. This includes for example sorting data, reordering loop iterations, using smaller data types.
There are many possible source of weird unusual unexpected behaviours that can occurs. If such a thing happens, then it is nearly impossible to understand what is going on without the exact code executed. All details matter in this case.
Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 months ago.
Improve this question
CPUs are still "improving", but their frequency haven't improved a lot during the last 10 years.
I can understand that the transistor count increases because of smaller and smaller transistors, but I don't understand how a non-parallel program (most programs are non-parallel, I think?) can be executed much faster on new CPUs if the frequency doesn't increase.
I can understand why GPU can be faster with more transistors because they're parallel processors (is that the right term?) and they only execute parallel code.
But most software is non parallel, so to me it seems that new CPU should not become much faster than previous CPUs, unless most programs can be parallelized, which is not the case (I'm not sure, but what are typical algorithms that can't be parallelized?).
Are larger L1/L2/L3 cache sizes allowing new CPU to be faster? Or are there other things like new instructions or branching things?
What am I missing?
More and more programs are using threads for things that can be parallelized reasonably. But you're asking about single-threaded (or per-core) performance, and that's totally fine and interesting.
You're missing instruction-level parallelism (ILP) and increasing IPC (instructions per cycle).
Also, SIMD (x86's SSE, AVX, AVX-512, or ARM NEON, SVE) to get more work done per instruction, exploiting data-parallelism in a (potentially small) loop that way instead of or as well as with threading. But that isn't a big factor for many applications.
Work per clock is instructions/cycle x work/insn x threads. (threads is basically number of clocks ticking at once, if your program is running on multiple cores). Even if threads is 1, the other two factors can increase.
A problem with lots of data parallelism (e.g. summing an array, or adding 1 to every element) can expose that parallelism to the CPU in three ways: SIMD, instruction-level parallelism (e.g. unroll with multiple accumulators if there's a dependency chain like a sum), and thread-level parallelism.
These are all orthogonal. And some of them apply to problem that aren't data parallel, just different steps of a complicated program. IPC applies all the time. With good enough branch prediction, CPUs can see far ahead in the instruction stream and finding parallel work to do (especially memory-level parallelism), as long as the code isn't doing something like traverse a linked list where the next load address depends on the current load result. Then you bottleneck on load latency, with no memory-level parallelism (except for whatever work you're doing on each node.)
Some major factors
Larger caches improve hit rates and effective bandwidth, leading to fewer stalls. That raises average IPC. (Also smarter cache-replacement algorithms, like L3 adaptive replacement in Ivy Bridge.)
Actual DRAM bandwidth increases help, too, (especially with good HW prefetching), but DRAM B/W is shared between cores. L1/L2 cache are private in modern CPUs, and L3 bandwidth scales nicely as well with different cores accessing different parts of it. Still, DRAM often comes into play, especially in code that isn't carefully tuned for cache-blocking. DRAM latency is near constant (in absolute nanoseconds, so getting "worse" in core clock cycles), but memory clocks have been climbing significantly in the past decade.
Larger ReOrder Buffers (ROB) and schedulers (RS) allow CPUs to find ILP over larger windows. Similarly, larger load and store buffers allow more memory-level parallelism, e.g. tracking more in-flight cache-miss loads in parallel. And having a larger buffer before you have to stall if a store misses in cache.
Better branch prediction reduces how often this speculative work has to be discarded if the CPU finds it had guessed the wrong path for an earlier branch.
Wider pipelines allow higher peak IPC. At best, in high-throughput code (not a lot of stalls, and lots of ILP), this can be sustained.
Otherwise, it at least helps get to the next stall sooner, doing a burst of work. And to clear out instructions waiting in the ROB when a cache-miss load does finally arrive, making room for new work to potentially see some independent work later. If execution of a loop condition can get far ahead of the actual work in the loop, a mispredict of the loop exit branch might be resolved before the back-end runs out of work to do. So a max IPC higher than the steady-state bottleneck of a loop is useful for loops that aren't infinite.
See also
Modern Microprocessors A 90-Minute Guide! - covers this quite well, how clocks mostly stopped increasing steadily once we hit the "power wall" in Pentium 4. As ever more efficient CPUs are designed, clocks are creeping up again, especially with fine-grained clock gating to stop heat generation from parts of a CPU that aren't doing anything in any given clock cycle.
This allows high turbo clocks when running "inefficient" code that bottlenecks on cache misses, branch misses, and other things like that so there aren't a lot of execution units busy at once.
How does a single thread run on multiple cores? - it doesn't, that's not what IPC is about. My answer there attempts to explain it so a beginner can understand.
https://www.realworldtech.com/sandy-bridge/ A deep dive on Sandybridge, how it finds instruction-level parallelism.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? (much more technical) - what to look for when predicting how fast a CPU might run a given loop or sequence of assembly instructions.
Given a cache size with constant capacity and associativity, for a given code to determine average of array elements, would a cache with higher block size be preferred?
[from comments]
Examine the code given below to compute the average of an array:
total = 0;
for(j=0; j < k; j++) {
sub_total = 0; /* Nested loops to avoid overflow */
for(i=0; i < N; i++) {
sub_total += A[jN + i];
}
total += sub_total/N;
}
average = total/k;
Related: in the more general case of typical access patterns with some but limited spatial locality, larger lines help up to a point. These "Memory Hierarchy: Set-Associative Cache" (powerpoint) slides by Hong Jiang and/or Yifeng Zhu (U. Maine) have a graph of AMAT (Average Memory Access Time) vs. block size showing a curve, and also breaking it down into miss penalty vs. miss rate (for a simple model I think, for a simple in-order CPU that sucks at hiding memory latency. e.g. maybe not even pipelining multiple independent misses. (miss under miss))
There is a lot of good stuff in those slides, including a compiler-optimization section that mentions loop interchange (to fix nested loops with column-major vs. row-major order), and even cache-blocking for more reuse. A lot of stuff on the Internet is crap, but I looked through these slides and they have some solid info on how caches are designed and what the tradeoffs are. The performance-analysis stuff is only really accurate for simple CPUs, not like modern out-of-order CPUs that can overlap some computation with cache-miss latency so more shorter misses is different from fewer longer misses.
Specific answer to this question:
So the only workload you care about is a linear traversal of your elements? That makes cache line size nearly irrelevant for performance, assuming good hardware prefetching. (So larger lines mean less HW complexity and power usage for the same performance.)
With software prefetch, larger lines mean less prefetch overhead (although depending on the CPU design, that may not hurt performance if you still max out memory bandwidth.)
Without any prefetching, a larger line/block size would mean more hits following every demand-miss. A single traversal of an array has perfect spatial locality and no temporal locality. (Actually not quite perfect spatial locality at the start/end, if the array isn't aligned to the start of a cache line, and/or ends in the middle of a line.)
If a miss has to wait until the entire line is present in cache before the load that caused the miss can be satisfied, this slightly reduces the advantage of larger blocks. (But most of the latency of a cache miss is in the signalling and request overhead, not in waiting for the burst transfer to complete after it's already started.)
A larger block size means fewer requests in flight with the same bandwidth and latency, and limited concurrency is a real limiting factor in memory bandwidth in real CPUs. (See the latency-bound platforms part of this answer about x86 memory bandwidth: many-core Xeons with higher latency to L3 cache have lower single-threaded bandwidth than a dual or quad-core of the same clock speed. Each core only has 10 line-fill buffers to track outstanding L1 misses, and bandwidth = concurrency / latency.)
If your cache-miss handling has an early restart design, even that bit of extra latency can be avoided. (That's very common, but Paul says theoretically possible to not have it in a CPU design). The load that caused the miss gets its data as soon as it arrives. The rest of the cache line fill happens "in the background", and hopefully later loads can also be satisfied from the partially-received cache line.
Critical word first is a related feature, where the needed word is sent first (for use with early restart), and the burst transfer then wraps around to transfer the earlier words of the block. In this case, the critical word will always be the first word, so no special hardware support is needed beyond early restart. (The U. Maine slides I linked above mention early restart / critical word first and point out that it decreases the miss penalty for large cache lines.)
An out-of-order execution CPU (or software pipelining on an in-order CPU) could give you the equivalent of HW prefetch by having multiple demand-misses outstanding at once. If the CPU "sees" the loads to another cache line while a miss to the current cache line is still outstanding, the demand-misses can be pipelined, again hiding some of the difference between larger or smaller lines.
If lines are too small, you'll run into a limit on how many outstanding misses for different lines your L1D can track. With larger lines or smaller out-of-order windows, you might have some "slack" when there's no outstanding request for the next cache line, so you're not maxing out the bandwidth. And you pay for it with bubbles in the pipeline when you get to the end of a cache line and the start of the next line hasn't arrived yet, because it started too late (while ALU execution units were using data from too close to the end of the current cache line.)
Related: these slides don't say much about the tradeoff of larger vs. smaller lines, but look pretty good.
The simplistic answer is that larger cache blocks would be preferred since the workload has no (data) temporal locality (no data reuse), perfect spacial locality (excluding the potentially inadequate alignment of the array for the first block and insufficient size of the array for the last block, every part of every block of data will be used), and a single access stream (no potential for conflict misses).
A more nuanced answer would consider the size and alignment of the array (the fraction of the first and last cache blocks that will be unused and what fraction of the memory transfer time that represents; for a 1 GiB array, even 4 KiB blocks would waste less than 0.0008% of the memory bandwidth), the ability of the system to use critical word first (if the array is of modest size and there is no support for early use of data as it becomes available rather than waiting for the entire block to be filled, then the start-up overhead will remove much of the prefetching advantage of larger cache blocks), the use of prefetching (software or hardware prefetching reduces the benefit of large cache blocks and this workload is extremely friendly to prefetching), the configuration of the memory system (e.g., using DRAM with an immediate page close controller policy would increase the benefit of larger cache blocks because each access would involve a row activate and row close, often to the same DRAM bank preventing latency overlap), whether the same block size is used for instructions and page table accesses and whether these accesses share the cache (instruction accesses provide a second "stream" which can introduce conflict misses; with shared caching of a two-level hierarchical page table TLB misses would access two cache blocks), whether simple way prediction is used (a larger block would increase prediction accuracy reducing misprediction overhead), and perhaps other factors.
From your example code we can't say either way as long as the hardware pre-fetcher can keep up a memory stream at maximum memory throughput.
In a random access scenario a shorter cache-line might be preferable as you then don't need to fill all the line. But the total amount of cached memory would go down as you need more circuits for tags and potentially more time for comparing.
So a compromise must be made Intel has chosen 64-bytes per line (and fetches 2 lines) others has chosen 32-bytes per line.
I was looking up the difference between CPU bound and IO bound programs. That was when I came across answers that explain that there are other variants like Memory Bound, Cache bound, etc.
I understand how Memory Bound (Multiplication of 2 large matrices in Main Memory) and IO Bound (grep) differ from each other and from CPU bound/Cache bound.
However, the difference between CPU Bound programs and IO Bound programs doesn't seem as clear. Here is what I gathered :
Cache bound - Speed of cache access is an important factor in deciding the speed at which the program gets executed. For example, if the most visited part of a program is a small chunk of code inside a loop small enough to be contained within the cache, then the program may be cache bound.
CPU bound - The speed at which CPU executes instructions is an important factor in deciding the speed at which the program gets executed.
But how can processes be CPU bound? I mean, instructions need to be fetched before execution (from cache/ Main Memory) every time, so, no matter how fast the CPU is, it will have to wait for the cache to finish data transfer and thus will at least be Cache Bound or Memory bound, since memory access is slower than instruction execution.
So is CPU bound the same as cache bound?
CPU architecture is very much like plumbing, just without the smell. When one of the pipes gets clogged, some others will overflow, while others will remain empty - both cases are bad utilization, but you need to find the jam to release everything.
Similarly, with a CPU you have multiple systems that need to work in unison to make the program progress. Each of these machines has an upper limit on the bandwidth it can work, and when it's reached - it will become a limitation, making the other systems underutilized or even stalled.
The main memory for example depends on the number of channels and the type of DRAM (and of course frequency), but let's say it commonly peaks at 25G/s in client CPUs. that means that any workload that tries to consume data beyond this rate, will become blocked by the memory BW (i.e. memory bound), and the rest of the systems will be underutilized.
Cache BW depends on the cache level (and the processor micro-architecture, and of course frequency of that cache domain), but you can find out where it peaks in the optimization guides.
According to 2.1.3 here, Intel Skylake for example provides 2 32B loads + 1 store per cycle from the L1 (though the actual utilization they quote is a little lower, probably due to collisions or writeback interference), L2 is effectively about 1/2 line per cycle and L3 a little less than 1/3. This means that if your data set is contained in one of these levels, you can reach that peak BW before being capped by that cache.
On the other hand, let's say you don't reach the peak cache bandwidth, instead consuming data from the L1 at a lower rate, but each element of data requires many complicated mathematical operations. In that case, you may be bounded by your execution bandwidth - more so if these operations are limited to only part of the execution ports (as is the case with some esoteric operations).
There are useful tools to determine what you're bounded by - look up TopDown analysis for example