How are cache blocks fetched from RAM into the cpu? - cpu

I'm learning more about the theoretical side of CPUs, and I read about how cache can be used to fetch a line/block of memory from RAM into an area closer to the CPU that can be accessed more quickly (I think it takes less clock cycles because the CPU doesn't need to move the entire address of the next word into a register, also it's closer to the CPU physically).
But now I'm not clear on the implementation exactly. The CPU is connected to RAM through a data bus that could be 32 or 64 bits wide in modern machines. But L3 cache can in some cases be as large as 32MB in size, and I am pretty convinced there aren't millions of data lines going from RAM to the CPU's cache. Even the tiny-in-comparison L1 cache of only a few KB will take hundreds or even thousands of clock cycles to fetch from RAM only through that tiny data bus.
So what I'm trying to understand is, how exactly is CPU cache implemented to transfer so much infortmation while still being efficient? Are there any examples of simple (relatively) CPUs from the last decades at which I can look to see and learn how they implemented that part of the architecture?

As it turns out, there actually is a very wide bus to move info between levels of cache. Thanks to Peter for pointing it out to me in the comments and providing useful links for further reading.

Since you want the implementation of the CPU cache and RAM(main memory) here's a helpful simulation link where you can give your size of RAM and cache and see how they work.
https://www3.ntu.edu.sg/home/smitha/ParaCache/Paracache/dmc.html

Related

How long does it take to fill a cache line?

Assuming a cache line is 64 bytes,
100 nanoseconds is the often quoted figure for main memory access, is this figure for 1 byte at a time or for 64 bytes at a time?
It's for a whole cache line, of course.
The busses / data-paths along the way are at least 8 bytes wide at every point, with the external DDR bus being the narrowest. (Possibly also the interconnect between sockets on a multi-core system.)
The "critical word" of the cache line might arrive a cycle or two before the rest of it on some CPUs, maybe even 8 on an ancient Pentium-M, but on many recent CPUs the last step between L2 and L1d is a full 64 bytes wide. To make best use of that link (for data going either direction), I assume the L2 superqueue waits to receive a full cache line from the 32-byte ring bus on Intel CPUs, for example.
Skylake for example has 12 Line Fill Buffers, so L1d cache can track cache misses on up to 12 lines in flight at the same time, loads+stores. And the L2 Superqueue has a few more entries than that, so it can track some additional requests created by hardware prefetching. Memory-level parallelism (as well as prefetching) is very important in mitigating the high latency of cache misses, especially demand loads that miss in L3 and have to go all the way to DRAM.
For some actual measurements, see https://www.7-cpu.com/cpu/Skylake.html for example, for Skylake-client i7-6700 with dual-channel DDR4-2400 CL15.
Intel "server" chips, big Xeons, have significantly higher memory latency, enough that it seriously reduces the memory (and L3) bandwidth available to a single core even if the others are idle. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?
Although I haven't heard if this has improved much with Ice Lake-server or Sapphire Rapids; it was quite bad when they first switched to a mesh interconnect (and non-inclusive L3) in Skylake-server.

Is CPU speed limited by the speed of fetching instructions from memory?

When learning assembly I realized that I should put frequently accessed data in registers instead of memory for memory is much slower.
The question is, how can CPU run faster than memory since the instructions are fetched from memory in the first place? Does CPU usualy spend a lot of time waiting for instructions from memory?
EDIT:
To run a program, we need to compile it to a file containing machine codes. Then we load that file into memory, and run one instruction after another. The CPU needs to know what instruction to run, and that piece of information is fetched from memory. I'm not asking about manipulating data but about the process of reading the instructions from memory. Sorry if I wasn't clear enough.
EDIT 2:
Example: xor eax, eax compiles to 31c0 on my computer. I know this instruction itself is fast. But to clear eax, the CPU needs to read 31c0 from memory first. That read should take a lot of time if accessing memory is slow, and for this period the CPU just stalls?
Code fetch in parallel with instruction execution is so critical that even 8086 did it (to a limited extent, with a very small prefetch buffer and low bandwidth). Even so, code fetch bandwidth actually was THE major bottleneck for 8086.
(I just realized you didn't tag this x86, although you did use an x86 instruction as an example. All my examples are x86, but the basics are pretty much the same for any other architecture. Except that non-x86 CPUs won't use a decoded-uop cache, x86 is the only ISA still in common use that's so hard to decode that it's worth caching the decode results.)
In modern CPUs, code-fetch is rarely a bottleneck because caches and prefetching hide the latency, and bandwidth requirements are usually low compared to the bandwidth required for data. (Bloated code with a very large code footprint can run into slowdowns from instruction-cache misses, though, leading to stalls in the front-end.)
L1I cache is separate from L1D cache, and CPUs fetch/decode a block of at least 16 bytes of x86 code per cycle. CPU with a decoded-uop cache (Intel Sandybridge family, and AMD Ryzen) even cache already-decoded instructions to remove decode bottlenecks.
See http://www.realworldtech.com/sandy-bridge/3/ for a fairly detailed write-up of the front-end in Intel Sandybridge (fetch/pre-decode/decode/rename+issue), with block diagrams like this, showing Intel Sandybridge vs. Intel Nehalem and AMD Bulldozer's instruction-fetch logic. (Decode is on the next page). The "pre-decode" stage finds instruction boundaries (i.e. decodes instruction-length ahead of decoding what each instruction actually is).
L1I cache misses result in a request to the unified L2. Modern x86 CPUs also have a shared L3 cache (shared between multiple cores).
Hardware prefetching brings soon-to-be-needed code into L2 and L1I, just like data prefetching into L2 and L1D. This hides the > 200 cycle latency to DRAM most of the time, usually only failing on jumps to "cold" functions. It can almost always stay ahead of decode/execute when running a long sequence of code with no taken branches, unless something else (like data loads/stores) is using up all the memory bandwidth.
You could construct some code that decodes at 16 bytes per cycle, which may be higher than main-memory bandwidth. Or maybe even higher on an AMD CPU. But usually decode bottlenecks will limit you more than pure code-fetch bandwidth.
See also Agner Fog's microarch guide for more about the front-end in various microarchitectures, and optimizing asm for them.
See also other CPU performance links in the x86 tag wiki.
If you have frequently accessed data, chances are that you also have the same instructions repeatedly processing them. An efficient CPU will not fetch the same instructions again and again from a slow memory. Instead, they are put in an instruction cache which has very little access time. Therefore, the cpu doesn't need to wait for instructions in general.
The memory is very slow compared to the CPU. Fetching data from RAM costs roughly 200 clock cycles, so in general it is very important for performance to write cache friendly code. And yes, the CPU spends a lot of time waiting for data.
Why this is the case? Well, it's simply different kinds of memory. In general it is more expensive to create fast memory, so in order to keep costs down, the fastest memory is reserved for the registers. The physical distance can be a limit for speed too. Memory you want to access fast needs to be close to the core. Light travel at a speed of around 300 000km/s, which means around 0.3mm/ns. If the memory is 0.3mm away, it's physically impossible to get the data under one nanosecond. The RAM is typically 10cm away, making it physically impossible to access under around 30ns. Modern CPU:s works with a frequency of GHz, so we have already reached the barrier where it would be impossible (not hard, impossible) to make the memory keep up with the CPU.
However, this physical limitation (theory of relativity) only affects access time and not bandwidth. So when you fetch data at address addr it does not cost anything extra to also fetch addr+1.
Between the registers and RAM you have cache. In a modern computer, it is typically three layers of cache. This works similarly to when data from a hard drive is cached in RAM. When you read a bit of data, it is likely that you will need surrounding data soon, so surrounding data is read at the same time and stored in the cache. When you ask for the next piece of data, it is likely to be in the cache. Whenever you request something from the memory, there are circuits that checks whether that piece of memory already exists in the cache or not.
You cannot control the cache directly. What you can do is to write cache friendly code. This can be tricky for advanced cases, but in general, the trick is to not jump around large distances in memory. Try to access the memory sequentially.
Here is a simple example of how to write cache friendly:
int *squareMatrix=malloc(SIZE*SIZE*sizeof(*squareMatrix));
int sum=0;
for(int i=0; i<SIZE; i++)
for(int j=0; j<SIZE; j++)
sum+=squareMatrix[i*SIZE+j];
And a non cache friendly version:
int *squareMatrix=malloc(SIZE*SIZE*sizeof(*squareMatrix));
int sum=0;
for(int i=0; i<SIZE; i++)
for(int j=0; j<SIZE; j++)
sum+=squareMatrix[j*SIZE+i];
The difference is [j*SIZE+i] vs [i*SIZE+j]. The first version reads the whole matrix sequentially, greatly increasing the chance that the next element will already be in the memory when you ask for it.
Here is the difference of the above code on my computer with SIZE=30000:
$ time ./fast
real 0m2.755s
user 0m2.516s
sys 0m0.236s
$ time ./slow
real 0m18.609s
user 0m18.268s
sys 0m0.340s
As you can see, this can affect performance significantly.
Typical access times and sizes for different types of memory. Very approximate, and just to give a general idea of it:
Memory type # Clock tics Size
===================|================|=============
register | 1 | 8B each, around 128B total
level1 cache | 5 | 32kB
level2 cache | 10 | 1MB
level3 cache | 50 | 20MB
RAM | 200 | 16GB
SSD drive | 10,000 | 500GB
Mechanical drive | 1,000,000 | 4TB
It could also be mentioned that the level1 cache is typically split into data and code.

hit ratio in cache - reading long sequence of bytes

Let assume that one row of cache has size 2^nB. Which hit ratio expected in the sequential reading byte by byte long contiguous memory?
To my eye it is (2^n - 1) / 2^n.
However, I am not sure if I am right. What do you think ?
yes, looks right for simple hardware (non-pipelined with no prefetching). e.g. 1 miss and 63 hits for 64B cache lines.
On real hardware, even in-order single-issue (non-superscalar) CPUs, miss under miss (multiple outstanding misses) is usually supported, so you will see misses until you run out of load buffers. This makes memory accesses pipelined as well, which is useful when misses to different cache lines can be in flight at once, instead of waiting the full latency for each one.
Real hardware will also have hardware prefetching. For example, have a look at Intel's article about disabling HW prefetching for some use-cases.
HW prefetching can probably keep up with a one-byte-at-a-time loop on most CPUs, so with good prefetching you might see hardly any L1 cache misses.
See Ulrich Drepper's What Every Programmer Should Know About Memory, and other links in the x86 tag wiki for more about real HW performance.

Operations that use only RAM

Can you please tell me some example code where we use ignorable amount of CPU and storage but heavy use of RAM? Like, if I run a loop and create objects, this will consume RAM but not CPU or storage. I mean tell me some memory expensive operations.
appzYourLife gave a good example, but I'd like to give a more conceptual answer.
Memory is slow. Like it's really slow, at least on the time scale that CPUs operate on. There is a concept called the memory hierarchy, which illustrates the trade off between cost/capacity and speed.
To prevent a fast CPU from wasting its time waiting on slow memory, we came up with CPU cache, which is a very small amount (it's expensive!) of very fast memory. The CPU never directly interacts with RAM, only the lowest level of CPU cache. Any time the CPU needs data that doesn't fall in the cache, it dispatches the memory controller to go fetch the desired data from RAM and put it in cache. The memory controller does this directly, without CPU involvement (so that the CPU can handle another process while wasting on this slow memory I/O).
The memory controller can be smart about how it does its memory fetching however. The principle of locality comes into play, which is the trend that CPUs tend to deal mostly with closely related (close in memory) data, such as arrays of data or long series of consecutive instructions. Knowing this, the memory controller can prefetch data from RAM that it predicts (according to various prediction algorithms, a key topic in CPU design) might be needed soon, and makes it available to the CPU before the CPU even knows it will need it. Think of it like a surgeon's assistant, who preempts what tools will be needed, and offers to hand them to the surgeon the moment they're needed, without the surgeon needing to request them, and without making the surgeon wait for the assistant to go get them and come back.
To maximize RAM usage, you'd need to minimize cache usage. This can be done by doing a lot of unexpected jumps between distant locations in memory. Typically, linked structures (such as linked lists) can cause this to happen. If a linked structure is composed of nodes that are scattered all throughout RAM, then there is no way for the memory controller to be able to predict all their locations and prefetch them. Traversing such a structure will cause many "cache misses" (a memory request for which the data isn't cached, and must be fetched from RAM), which are RAM intensive.
Ultimately, the CPU would usually be used heavily too, because it won't sit around waiting for the memory access, but will instead execute the instructions of the other processes running on the system, if there are any.
In Swift the Int64 type requires 64 bit of memory. So if you allocate space for 1000000 Int64 you will reserve memory for 8 MB.
UnsafeMutablePointer<Int64>.alloc(1000000)
The process should not consume much CPU since you are not initializing that memory, you are just allocating it.

Cache or Registers - which is faster?

I'm sorry if this is the wrong place to ask this but I've searched and always found different answer. My question is:
Which is faster? Cache or CPU Registers?
According to me, the registers are what directly load data to execute it while the cache is just a storage place close or internally in the CPU.
Here are the sources I found that confuses me:
2 for cache | 1 for registers
http://in.answers.yahoo.com/question/index?qid=20110503030537AAzmDGp
Cache is faster.
http://wiki.answers.com/Q/Is_cache_memory_faster_than_CPU_registers
So which really is it?
CPU register is always faster than the L1 cache. It is the closest. The difference is roughly a factor of 3.
Trying to make this as intuitive as possible without getting lost in the physics underlying the question: there is a simple correlation between speed and distance in electronics. The further you make a signal travel, the harder it gets to get that signal to the other end of the wire without the signal getting corrupted. It is the "there is no free lunch" principle of electronic design.
The corollary is that bigger is slower. Because if you make something bigger then inevitably the distances are going to get larger. Something that was automatic for a while, shrinking the feature size on the chip automatically produced a faster processor.
The register file in a processor is small and sits physically close to the execution engine. The furthest removed from the processor is the RAM. You can pop the case and actually see the wires between the two. In between sit the caches, designed to bridge the dramatic gap between the speed of those two opposites. Every processor has an L1 cache, relatively small (32 KB typically) and located closest to the core. Further down is the L2 cache, relatively big (4 MB typically) and located further from the core. More expensive processors also have an L3 cache, bigger and further away.
Specifically on x86 architecture:
Reading from register has 0 or 1 cycle latency.
Writing to registers has 0 cycle latency.
Reading/Writing L1 cache has a 3 to 5 cycle latency (varies by architecture age)
Actual load/store requests may execute within 0 or 1 cycles due to write-back buffer and store-forwarding features (details below)
Reading from register can have a 1 cycle latency on Intel Core 2 CPUs (and earlier models) due to its design: If enough simultaneously-executing instructions are reading from different registers, the CPU's register bank will be unable to service all the requests in a single cycle. This design limitation isn't present in any x86 chip that's been put on the consumer market since 2010 (but it is present in some 2010/11-released Xeon chips).
L1 cache latencies are fixed per-model but tend to get slower as you go back in time to older models. However, keep in mind three things:
x86 chips these days have a write-back cache that has a 0 cycle latency. When you store a value to memory it falls into that cache, and the instruction is able to retire in a single cycle. Memory latency then only becomes visible if you issue enough consecutive writes to fill the write-back cache. Writeback caches have been prominent in desktop chip design since about 2001, but was widely missing from the ARM-based mobile chip markets until much more recently.
x86 chips these days have store forwarding from the write-back cache. If you store an address to the WB cache and then read back the same address several instructions later, the CPU will fetch the value from the WB cache instead of accessing L1 memory for it. This reduces the visible latency on what appears to be an L1 request to 1 cycle. But in fact, the L1 isn't be referenced at all in that case. Store forwarding also has some other rules for it to work properly, which also vary a lot across the various CPUs available on the market today (typically requiring 128-bit address alignment and matched operand size).
The store forwarding feature can generate false positives where-in the CPU thinks the address is in the writeback buffer based on a fast partial-bits check (usually 10-14 bits, depending on chip). It uses an extra cycle to verify with a full check. If that fails then the CPU must re-route as a regular memory request. This miss can add an extra 1-2 cycles latency to qualifying L1 cache accesses. In my measurements, store forwarding misses happen quite often on AMD's Bulldozer, for example; enough so that its L1 cache latency over-time is about 10-15% higher than its documented 3-cycles. It is almost a non-factor on Intel's Core series.
Primary reference: http://www.agner.org/optimize/ and specifically http://www.agner.org/optimize/microarchitecture.pdf
And then manually graph info from that with the tables on architectures, models, and release dates from the various List of CPUs pages on wikipedia.

Resources