How and where are instructions cached? - caching

If i were to run a binary executable program, do all of the instructions have to be stored in the instruction cache? Is the instruction cache a location in one of the L1, L2, or L3 CPU caches or is it a different entity? I am not very clear on what physically happens when i run an executable and any sort of clarity would be greatly appreciated.

The instructions are stored permanently in the binary executable file on your mass storage device. When the program is executed, the contents of the executable file are loaded into primary memory (RAM). Generally speaking, all traffic between the CPU and RAM goes through a cache; so do the instructions contained by the executable.
A cache may contain instructions, data, or both. In the case of Intel CPUs, there are two L1 caches: L1d for data and L1i for instructions; L2 and L3 are shared. The following image illustrates this quite clearly:
Image source
That is the short answer. If you wish to learn more, there is plenty of good material around. Ulrich Drepper's What every programmer should know about memory is a great read for one.

Your executable on disk is logically memory-mapped into the virtual address space of a new process.
As usually with virtual memory, touching a 4k pages of the executable may trigger a page fault if the data isn't actually in DRAM (hard page fault) or if it is but the HW page tables haven't "wired" that page. Soft page faults might still happen on each repeated runs because OSes are typically "lazy" about wiring up page tables, even when file data is in the kernel's pagecache.
(x86 uses 4kiB virtual memory pages. It's a common page size but some other ISA do or can use other page sizes like 8k or 16k.)
The kernel may optimize by making sure at least the page containing the process entry point is loaded from disk and wired up before entering user-space. Otherwise it will just page-fault right away back to the kernel. e.g. very early Linux, like 0.11, worked this way, as a get-it-working step in development before making it good.
When data is in DRAM, execution works by the CPU doing code-fetch through L1i / L2 / L3 cache. This is exactly like data loads, but goes through L1 I-cache instead of L1 D-cache, assuming a machine with split L1 caches. Outer levels of cache (assuming they exist) are almost always unified. Modern x86 CPUs and other high-end chips like POWER do typically have 3 levels of cache, usually L1i/d and L2 are per-core private, with a large shared L3. Otherwise you might have just private L1i/L1d and shared L2, or in a single-core system maybe just 1 or 2 levels of cache.
These caches have a line size of 64 bytes on most CPUs (including all x86 CPUs since P4 / Core 2 or so). Cache misses don't fault, the core just has to wait for the line to arrive. If there's any out-of-order execution still not done, it can still be happening while a code-cache miss is in flight. But otherwise the CPU is stuck with nothing at all to do.
(TLB misses are also a thing. Most ISAs have hardware page-walk that makes it transparent to software, but an iTLB miss can similarly stall instruction fetch leaving the CPU without any queued up work to do. So i-cache / i-TLB misses are even worse than data-cache / d-TLB misses where miss-under-miss / hit-under-miss, and out-of-order execution, can let some useful work continue.)
In modern x86 CPUs, decoded instructions are cached in a small, very fast "uop cache", as well as caching the machine-code bytes in L1i cache.
Pentium 4 didn't have an L1i cache, only a trace cache, but that didn't work out very well. And it didn't have the transistor or power budget for enough decoders to build traces fast on trace-cache misses. This was one of several major downsides of the NetBurst microarchitecture.

Related

Is there a performance gain on x86 Windows/Linux apps in minimizing page count?

I'm writing and running software on Windows and Linux of the last five years, say, on machines of the last five years.
I believe the running processes will never get close to the amount of memory we have (we're typically using like 3GB out of 32GB) but I'm certain we'll be filling up the cache to overflowing.
I have a data structure modification which makes the data structure run a bit slower (that is, do quite a bit more pointer math) but reduces both the cache line count and page count (working set) of the application.
My understanding is that if processes' memory is being paged out to disk, reducing the page count and thus disk I/O is a huge win in terms of wall clock time and latency (e.g. to respond to network events, user events, timers, etc.) However with memory so cheap nowadays, I don't think processes using my data structure will ever page out of memory. So is there still an advantage in improving this? For example if I'm using 10,000 cache lines, does it matter whether they're spread over 157 4k pages (=643k) or 10,000 4k pages (=40MB)? Subquestion: even if you're nowhere near filling up RAM, are there cases a modern OS will page a running process out anyways? (I think Linux of the 90s might have done so to increase file caching, for instance.)
In contrast, I also can reduce cache lines and I suspect this will probably wall clock time quite a bit. Just to be clear on what I'm counting, I'm counting the number of 64-byte-aligned areas of memory that I'm touching at least one byte in as the total cache lines used by this structure. Example: if hypothetically I have 40,000 16 byte structures, a computer whose processes are constantly referring to more memory that it has L1 cache will probably run everything faster if those structures are packed into 10,000 64 byte cache lines instead of each straddling two cache lines apiece for 80,000 cache lines?
The general recommendation is to minimize the number of cache lines and virtual pages that exhibit high temporal locality on the same core during the same time interval of the lifetime of a given application.
When the total number of needed physical pages is about to exceed the capacity of main memory, the OS will attempt to free up some physical memory by paging out some of the resident pages to secondary storage. This situation may result in major page faults, which can significantly impact performance. Even if you know with certainty that the system will not reach that point, there are other performance issues that may occur when unnecessarily using more virtual pages (i.e., there is wasted space in the used pages).
On Intel and AMD processors, page table entries and other paging structures are cached in hardware caches in the memory management unit to efficiently translate virtual addresses to physical addresses. These include the L1 and L2 TLBs. On an L2 TLB miss, a hardware component called the page walker is engaged to peform the required address translation. More pages means more misses. On pre-Broadwell1 and pre-Zen microarchitectures, there can only be one outstanding page walk at any time. On later microarchitectures, there can only be two. In addition, on Intel Ivy Bridge and later, it may be more difficult for the TLB prefetcher to keep up with the misses.
On Intel and AMD processors, the L1D and L2 caches are designed so that all cache lines within the same 4K page are guaranteed to be mapped to different cache sets. So if all of the cache lines of a page are used, in contest to, for example, spreading out the cache lines in 10 different pages, conflict misses in these cache levels may be reduced. That said, on all AMD processors and on Intel pre-Haswell microarchitectures, bank conflicts are more likely to occur when accesses are more spread across cache sets.
On Intel and AMD processors, hardware data prefetchers don't work across 4K boundaries. An access pattern that could be detected by one or more prefetchers but has the accesses spread out across many pages would benefit less from hardware prefetching.
On Windows/Intel, the accessed bits of the page table entries of all present pages are reset every second by the working set manager. So when accesses are unnecessarily spread out in the virtual address space, the number of page walks that require microcode assists (to set the accessed bit) per memory access may become larger.
The same applies to minor page faults as well (on Intel and AMD). Both Windows and Linux use demand paging, i.e., an allocated virtual page is only mapped to a physical page on demand. When a virtual page that is not yet mapped to a physical page is accessed, a minor page fault occurs. Just like with the accessed bit, the number of minor page faults per access may be larger.
On Intel processors, with more pages accessed, it becomes more likely for nearby accesses on the same logical core to 4K-alias. For more information on 4K aliasing, see: L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes.
There are probably other potential issues.
Subquestion: even if you're nowhere near filling up RAM, are there
cases a modern OS will page a running process out anyways? (I think
Linux of the 90s might have done so to increase file caching, for
instance.)
On both Windows and Linux, there is a maximum working set size limit on each process. On Linux, this is called RLIMIT_RSS and is not enforced. On Windows, this limit can be either soft (the default) or hard. The limit is only enforced if it is hard (which can be specified by calling the SetProcessWorkingSetSizeEx function and passing the QUOTA_LIMITS_HARDWS_MIN_ENABLE flag). When a process reaches its hard working set limit, further memory allocation requests will be satisfied by paging out some of its pages to the page file, even if free physical memory is available.
I don't know about Linux of the 90s.
Footnotes:
(1) The Intel optimization manual mentions in Section 2.2.3 that Skylake can do two page walks in parallel compared to one in Haswell and earlier microarchitectures. To my knowledge, the manual does not clearly mention whether Broadwell can do one or two page walks in parallel. However, according to slide 10 of these Intel slides (entitled "Intel Xeon Processor D: The First Xeon processor optimized for dense solutions"), Xeon Broadwell supports two page walks. I think this also applies to all Broadwell models.

Is cache prefetching done in hardware address space or virtual address space?

Does the hardware prefetcher operate on contiguous virtual addresses, or is it operating on contiguous hardware addresses? Imagine the case where you have a large array of bytes which span multiple pages. In the virtual address space the bytes are contiguous, but in fact the pages could be allocated in disjoint pages in hardware. I would hope that the prefetcher is able to do the appropriate conversion using the TLB before it starts to bring in cache lines that belong to the next page.
Is this so?
I couldn't find information that confirmed this and was hoping someone could give more insight.
I'm asking for x86 mainly, but any insight would be appreciated
I can't answer this for AMD processors, but I can answer it for Intel ones.
As far as I know, the hardware prefetcher(s) should not prefetch cache lines across page boundaries on current Intel processors.
From Intel's IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual, section 7.5.2, Hardware Prefetch:
Automatic hardware prefetch can bring cache lines into the unified last-level cache based on prior data misses. It will attempt to prefetch two cache lines ahead of the prefetch stream. Characteristics of the hardware prefetcher are:
[...]
It will not prefetch across a 4-KByte page boundary. A program has to initiate demand loads for the new page before the hardware prefetcher starts prefetching from the new page.
Above paragraph is talking about "unified last-level cache", but things aren't better in L1d land:
2.3.5.4, Data Prefetching
Data Prefetch to L1 Data Cache
Data prefetching is triggered by load operations when the following conditions are met:
[...]
The prefetched data is within the same 4K byte page as the load instruction that triggered it.
Or in L2:
The following two hardware prefetchers fetched data from memory to the L2 cache and last level cache:
Spatial Prefetcher: [...]
Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page.
However, the processor might prefetch paging data. From Intel's IntelĀ® 64 and IA-32 Architectures Software Developer Manuals, Volume 3A, 4.10.2.3, Details of TLB Use:
The processor may cache translations required for prefetches and for accesses that are a result of speculative execution that would never actually occur in the executed code path.
Volume 3A, 4.10.3.1, Caches for Paging Structures:
The processor may create entries in paging-structure caches for translations required for prefetches and for accesses that are a result of speculative execution that would never actually occur in the executed code path.
I know you asked about hardware prefetching, but you should be able to use software prefetching for data (not instructions):
In older microarchitectures, PREFETCH causing a Data Translation Lookaside Buffer (DTLB) miss would be dropped. In processors based on Nehalem, Westmere, Sandy Bridge, and newer microarchitectures, Intel Core 2 processors, and Intel Atom processors, PREFETCH causing a DTLB miss can
be fetched across a page boundary.

Is the TLB shared between multiple cores?

I've heard that TLB is maintained by the MMU not the CPU cache.
Then Does One TLB exist on the CPU and is shared between all processor or each processor has its own TLB cache?
Could anyone please explain relationship between MMU and L1, L2 Cache?
The TLB caches the translations listed in the page table. Each CPU core can be running in a different context, with different page tables. This is what you'd call the MMU, if it was a separate "unit", so each core has its own MMU. Any shared caches are always physically-indexed / physically tagged, so they cache based on post-MMU physical address.
The TLB is a cache (of PTEs), so technically it's just an implementation detail that could vary by microarchitecture (between different implementations of the x86 architecture).
In practice, all that really varies is the size. 2-level TLBs are common now, to keep full TLB misses to a minimum but still be fast enough allow 3 translations per clock cycle.
It's much faster to just re-walk the page tables (which can be hot in local L1 data or L2 cache) to rebuild a TLB entry than to try to share TLB entries across cores. This is what sets the lower bound on what extremes are worth going to in avoiding TLB misses, unlike with data caches which are the last line of defence before you have to go off-core to shared L3 cache, or off-chip to DRAM on an L3 miss.
For example, Skylake added a 2nd page-walk unit (to each core). Good page-walking is essential for workloads where cores can't usefully share TLB entries (threads from different processes, or not touching many shared virtual pages).
A shared TLB would mean that invlpg to invalidate cached translations when you do change a page table would always have to go off-core. (Although in practice an OS needs to make sure other cores running other threads of a multi-threaded process have their private TLB entries "shot down" during something like munmap, using software methods for inter-core communication like an IPI (inter-processor interrupt).)
But with private TLBs, a context switch to a new process can just set a new CR3 (top-level page-directory pointer) and invalidate this core's whole TLB without having to bother other cores or track anything globally.
There is a PCID (process context ID) feature that lets TLB entries be tagged with one of 16 or so IDs so entries from different process's page tables can be hot in the TLB instead of needing to be flushed on context switch. For a shared TLB you'd need to beef this up.
Another complication is that TLB entries need to track "dirty" and "accessed" bits in the PTE. They're necessarily just a read-only cache of PTEs.
For an example of how the pieces fit together in a real CPU, see David Kanter's writeup of Intel's Sandybridge design. Note that the diagrams are for a single SnB core. The only shared-between-cores cache in most CPUs is the last-level data cache.
Intel's SnB-family designs all use a 2MiB-per-core modular L3 cache on a ring bus. So adding more cores adds more L3 to the total pool, as well as adding new cores (each with their own L2/L1D/L1I/uop-cache, and two-level TLB.)

Cache or Registers - which is faster?

I'm sorry if this is the wrong place to ask this but I've searched and always found different answer. My question is:
Which is faster? Cache or CPU Registers?
According to me, the registers are what directly load data to execute it while the cache is just a storage place close or internally in the CPU.
Here are the sources I found that confuses me:
2 for cache | 1 for registers
http://in.answers.yahoo.com/question/index?qid=20110503030537AAzmDGp
Cache is faster.
http://wiki.answers.com/Q/Is_cache_memory_faster_than_CPU_registers
So which really is it?
CPU register is always faster than the L1 cache. It is the closest. The difference is roughly a factor of 3.
Trying to make this as intuitive as possible without getting lost in the physics underlying the question: there is a simple correlation between speed and distance in electronics. The further you make a signal travel, the harder it gets to get that signal to the other end of the wire without the signal getting corrupted. It is the "there is no free lunch" principle of electronic design.
The corollary is that bigger is slower. Because if you make something bigger then inevitably the distances are going to get larger. Something that was automatic for a while, shrinking the feature size on the chip automatically produced a faster processor.
The register file in a processor is small and sits physically close to the execution engine. The furthest removed from the processor is the RAM. You can pop the case and actually see the wires between the two. In between sit the caches, designed to bridge the dramatic gap between the speed of those two opposites. Every processor has an L1 cache, relatively small (32 KB typically) and located closest to the core. Further down is the L2 cache, relatively big (4 MB typically) and located further from the core. More expensive processors also have an L3 cache, bigger and further away.
Specifically on x86 architecture:
Reading from register has 0 or 1 cycle latency.
Writing to registers has 0 cycle latency.
Reading/Writing L1 cache has a 3 to 5 cycle latency (varies by architecture age)
Actual load/store requests may execute within 0 or 1 cycles due to write-back buffer and store-forwarding features (details below)
Reading from register can have a 1 cycle latency on Intel Core 2 CPUs (and earlier models) due to its design: If enough simultaneously-executing instructions are reading from different registers, the CPU's register bank will be unable to service all the requests in a single cycle. This design limitation isn't present in any x86 chip that's been put on the consumer market since 2010 (but it is present in some 2010/11-released Xeon chips).
L1 cache latencies are fixed per-model but tend to get slower as you go back in time to older models. However, keep in mind three things:
x86 chips these days have a write-back cache that has a 0 cycle latency. When you store a value to memory it falls into that cache, and the instruction is able to retire in a single cycle. Memory latency then only becomes visible if you issue enough consecutive writes to fill the write-back cache. Writeback caches have been prominent in desktop chip design since about 2001, but was widely missing from the ARM-based mobile chip markets until much more recently.
x86 chips these days have store forwarding from the write-back cache. If you store an address to the WB cache and then read back the same address several instructions later, the CPU will fetch the value from the WB cache instead of accessing L1 memory for it. This reduces the visible latency on what appears to be an L1 request to 1 cycle. But in fact, the L1 isn't be referenced at all in that case. Store forwarding also has some other rules for it to work properly, which also vary a lot across the various CPUs available on the market today (typically requiring 128-bit address alignment and matched operand size).
The store forwarding feature can generate false positives where-in the CPU thinks the address is in the writeback buffer based on a fast partial-bits check (usually 10-14 bits, depending on chip). It uses an extra cycle to verify with a full check. If that fails then the CPU must re-route as a regular memory request. This miss can add an extra 1-2 cycles latency to qualifying L1 cache accesses. In my measurements, store forwarding misses happen quite often on AMD's Bulldozer, for example; enough so that its L1 cache latency over-time is about 10-15% higher than its documented 3-cycles. It is almost a non-factor on Intel's Core series.
Primary reference: http://www.agner.org/optimize/ and specifically http://www.agner.org/optimize/microarchitecture.pdf
And then manually graph info from that with the tables on architectures, models, and release dates from the various List of CPUs pages on wikipedia.

How are cache memories shared in multicore Intel CPUs?

I have a few questions regarding Cache memories used in Multicore CPUs or Multiprocessor systems. (Although not directly related to programming, it has many repercussions while one writes software for multicore processors/multiprocessors systems, hence asking here!)
In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)?
Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache misses, in the scenario that if that particular processors cache does not have some data but some other second processors' cache might have it thus avoiding a read from memory into cache of first processor? Is this assumption valid and true?
Will there be any problems in allowing any processor to access other processor's cache memory?
In a multiprocessor system or a multicore processor (Intel Quad Core,
Core two Duo etc..) does each cpu core/processor have its own cache
memory (data and program cache)?
Yes. It varies by the exact chip model, but the most common design is for each CPU core to have its own private L1 data and instruction caches.
On old and/or low-power CPUs, the next level of cache is typically a L2 unified cache is typically shared between all cores. Or on 65nm Core2Quad (which was two core2duo dies in one package), each pair of cores had their own last-level cache and couldn't communicate as efficiently.
Modern mainstream Intel CPUs (since the first-gen i7 CPUs, Nehalem) use 3 levels of cache.
32kiB split L1i/L1d: private per-core (same as earlier Intel)
256kiB unified L2: private per-core. (1MiB on Skylake-avx512).
large unified L3: shared among all cores
Last-level cache is a a large shared L3. It's physically distributed between cores, with a slice of L3 going with each core on the ring bus that connects the cores. Typically 1.5 to 2.25MB of L3 cache with every core, so a many-core Xeon might have a 36MB L3 cache shared between all its cores. This is why a dual-core chip has 2 to 4 MB of L3, while a quad-core has 6 to 8 MB.
On CPUs other than Skylake-avx512, L3 is inclusive of the per-core private caches so its tags can be used as a snoop filter to avoid broadcasting requests to all cores. i.e. anything cached in a private L1d, L1i, or L2, must also be allocated in L3. See Which cache mapping technique is used in intel core i7 processor?
David Kanter's Sandybridge write-up has a nice diagram of the memory heirarchy / system architecture, showing the per-core caches and their connection to shared L3, and DDR3 / DMI(chipset) / PCIe connecting to that. (This still applies to Haswell / Skylake-client / Coffee Lake, except with DDR4 in later CPUs).
Can one processor/core access each other's cache memory, because if
they are allowed to access each other's cache, then I believe there
might be lesser cache misses, in the scenario that if that particular
processors cache does not have some data but some other second
processors' cache might have it thus avoiding a read from memory into
cache of first processor? Is this assumption valid and true?
No. Each CPU core's L1 caches tightly integrate into that core. Multiple cores accessing the same data will each have their own copy of it in their own L1d caches, very close to the load/store execution units.
The whole point of multiple levels of cache is that a single cache can't be fast enough for very hot data, but can't be big enough for less-frequently used data that's still accessed regularly. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
Going off-core to another core's caches wouldn't be faster than just going to L3 in Intel's current CPUs. Or the required mesh network between cores to make this happen would be prohibitive compared to just building a larger / faster L3 cache.
The small/fast caches built-in to other cores are there to speed up those cores. Sharing them directly would probably cost more power (and maybe even more transistors / die area) than other ways of increasing cache hit rate. (Power is a bigger limiting factor than transistor count or die area. That's why modern CPUs can afford to have large private L2 caches).
Plus you wouldn't want other cores polluting the small private cache that's probably caching stuff relevant to this core.
Will there be any problems in allowing any processor to access other
processor's cache memory?
Yes -- there simply aren't wires connecting the various CPU caches to the other cores. If a core wants to access data in another core's cache, the only data path through which it can do so is the system bus.
A very important related issue is the cache coherency problem. Consider the following: suppose one CPU core has a particular memory location in its cache, and it writes to that memory location. Then, another core reads that memory location. How do you ensure that the second core sees the updated value? That is the cache coherency problem.
The normal solution is the MESI protocol, or a variation on it. Intel uses MESIF.
Quick answers
1) Yes 2)No, but it all may depend on what memory instance/resource you are referring, data may exist in several locations at the same time. 3)Yes.
For a full length explanation of the issue you should read the 9 part article "What every programmer should know about memory" by Ulrich Drepper ( http://lwn.net/Articles/250967/ ), you will get the full picture of the issues you seem to be inquiring about in a good and accessible detail.
To answer your first, I know the Core 2 Duo has a 2-tier caching system, in which each processor has its own first-level cache, and they share a second-level cache. This helps with both data synchronization and utilization of memory.
To answer your second question, I believe your assumption to be correct. If the processors were to be able to access each others' cache, there would obviously be less cache misses as there would be more data for the processors to choose from. Consider, however, shared cache. In the case of the Core 2 Duo, having shared cache allows programmers to place commonly used variables safely in this environment so that the processors will not have to access their individual first-level caches.
To answer your third question, there could potentially be a problem with accessing other processors' cache memory, which goes to the "Single Write Multiple Read" principle. We can't allow more than one process to write to the same location in memory at the same time.
For more info on the core 2 duo, read this neat article.
http://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems/

Resources