Why are CPU caches faster than RAM? - cpu

Can the same technology/idea that is used to make caches fast not be used to make RAM fast too?
Or is there a fundamental tradeoff to be made here between speed of access vs. size of the store.

Cost. Cache is SRAM and main memory is DRAM. Different technologies, different use cases. You couldn't afford a computer that used SRAM for (a reasonable amount of) main memory.

The Distance: CPU is much more tightly packed than a RAM and therefore electric signals need to travel a lot longer.
How they work:
RAM uses a small capacitor that’s connected to the transistor which stores a tiny charge. This charge will not stay long and it needs to be constantly refreshed to maintain its state which makes it slower.
CPU Cache on the other hand uses a flip-flop gate which retains its state without refreshing. It’s more expensive to develop but it’s much faster.

Related

Need help understanding a Cache concept

I've been reading a lot about operating systems and all lately. I understand how cache works and what is it used for.
However, when I asked a question to myself, I was unable to find an answer.
If a cache can be made as large as the device it is caching (for instance, a cache as large as a disk), why not do so and eliminate the device.
Assuming the cache is in-memory, it's not persistent, meaning you'll lose it once your machine is reboot. If nothing else, you'll need persistent storage (disk) to avoid losing data on restarts and power loses.
For a disk cache, there's no technical limit why a battery-backed RAM cache or flash storage usually used for HDD cache can't be combined together to provide terabytes of capacity. Your average user won't be using those RAM-based storage products because it's far more expensive than normal storage and they still consume too much power to leave unplugged for hours. We are leaving SSHD behind for full flash storage, but even they have RAM cache to reduce & consolidate IO calls.
For CPU cache, there's actually a practical limit to it, they're taking precious physical space that can be used for other unit, increasing distance (thus latency), and generate heat (which put a hard limit no matter how large is your cooling budget). That's why CPU cache is multilevel instead, with the small L1 tightly coupled to a core, the bigger L2 shared by multiple cores, and the slower but even bigger L3 with the slower but cheaper component. Even with an unlimited budget and an exotic process, a multilevel cache CPU will have better performance than a single-level cache CPU that's forced to put everything on (relatively) far silicon and suffering from the latency.

What are the trade-offs of larger cache memories ? Could we use one to replace secondary memory?

What are the disadvantages of using larger cache memories? Could we use just use a large enough cache memory so a secondary memory wouldn't be needed at all? I understand that the most compelling arguments are related to the cost of it / the problem of it's size. But if we assume that creating such a cache memory is possible, would it encounter any additional problems?
Many problems even if it was not expensive
Size will degrade the performance
Cache is fast because it’s very small compared to the main memory and hence it requires small amount of time to search it. If you build a large cache then it will not be able to perform at the same speed as the smaller counterpart.
Larger die area
Most of the DRAM chips only require a capacitor and a transistor to store a bit. SRAM on the other hand requires at least 6 transistors to make a single cell of memory. Which requires more area.
High power requirements
Because of the more transistors SRAM requires more power to operate. Which in turn generates more heat so you will have to handle the cooling problem.
So as you can see it’s not worth the effort given that today’s computers already achieve 90% hit ratio most of the time.

Operations that use only RAM

Can you please tell me some example code where we use ignorable amount of CPU and storage but heavy use of RAM? Like, if I run a loop and create objects, this will consume RAM but not CPU or storage. I mean tell me some memory expensive operations.
appzYourLife gave a good example, but I'd like to give a more conceptual answer.
Memory is slow. Like it's really slow, at least on the time scale that CPUs operate on. There is a concept called the memory hierarchy, which illustrates the trade off between cost/capacity and speed.
To prevent a fast CPU from wasting its time waiting on slow memory, we came up with CPU cache, which is a very small amount (it's expensive!) of very fast memory. The CPU never directly interacts with RAM, only the lowest level of CPU cache. Any time the CPU needs data that doesn't fall in the cache, it dispatches the memory controller to go fetch the desired data from RAM and put it in cache. The memory controller does this directly, without CPU involvement (so that the CPU can handle another process while wasting on this slow memory I/O).
The memory controller can be smart about how it does its memory fetching however. The principle of locality comes into play, which is the trend that CPUs tend to deal mostly with closely related (close in memory) data, such as arrays of data or long series of consecutive instructions. Knowing this, the memory controller can prefetch data from RAM that it predicts (according to various prediction algorithms, a key topic in CPU design) might be needed soon, and makes it available to the CPU before the CPU even knows it will need it. Think of it like a surgeon's assistant, who preempts what tools will be needed, and offers to hand them to the surgeon the moment they're needed, without the surgeon needing to request them, and without making the surgeon wait for the assistant to go get them and come back.
To maximize RAM usage, you'd need to minimize cache usage. This can be done by doing a lot of unexpected jumps between distant locations in memory. Typically, linked structures (such as linked lists) can cause this to happen. If a linked structure is composed of nodes that are scattered all throughout RAM, then there is no way for the memory controller to be able to predict all their locations and prefetch them. Traversing such a structure will cause many "cache misses" (a memory request for which the data isn't cached, and must be fetched from RAM), which are RAM intensive.
Ultimately, the CPU would usually be used heavily too, because it won't sit around waiting for the memory access, but will instead execute the instructions of the other processes running on the system, if there are any.
In Swift the Int64 type requires 64 bit of memory. So if you allocate space for 1000000 Int64 you will reserve memory for 8 MB.
UnsafeMutablePointer<Int64>.alloc(1000000)
The process should not consume much CPU since you are not initializing that memory, you are just allocating it.

DRAM and its effects on real world performance

After learning a little on how computer programs run I had some thoughts concerning the cpu and RAM. After watching a few youtube videos (linus tech tips and others) they all seem to show that increasing a RAM speed (frequency) does not really have much of a performance improvement in real world applications and games on a general desktop computer. My first question is why is this? Is it because of the high hit rates (95% and above) of the cpu's cache on most modern cpus? Which in turn would lead to less and less need for the cpu to reach out to ram? Also, in which situations would faster RAM frequency be beneficial?
NOTE: this is a very broad question, and the answers can vary very differently depending on the architecture/OS running the system. I am answering from a best-judgement standpoint on how these things generally work
Why is there not a larger performance difference between different RAM clock speeds?
I would imagine that the clock speed of the RAM of the computer matters less than the clock speed of the CPU cache. Because:
the CPU gets its instructions from the cache, not straight from RAM
with the larger cache sizes of modern CPU's, it is less necessary to need to go out to RAM as often.
When the cache needs to go out to RAM, it uses an asynchronous processor (DMA) to grab more information, allowing the CPU to switch to a different process entirely.
Besides that, the clock speed of the motherboard's various pipelines (DMA) could be creating a chokepoint where it is slowing the transfer rate of the information overall.
which situations would faster RAM frequency be beneficial?
I would say that overall, any one of the core pieces of hardware involved with memory and its use and transfer (CPU, CPU Cache, The Various memory pipelines, the various memory transfer devices (DMA, etc.), the RAM itself) can cause a chokepoint where faster RAM might or might not affect the overall performance. It is really a by-case issue.

Estimating how processor frequency affects I/O performance

I am doing research about dedicated I/O software that would run on consumer hardware. Essentially it boils down to saving huge data streams for later processing. Right now I am looking for a model to estimate performance factors on x86.
Take for example the new Macbook Pro:
high-speed Thunderbolt I/O (input/output) technology delivers
an amazing 10 gigabits per second of transfer speeds in both
directions
1.25 GB/s sounds nice but most processors of the day are clocked around 2 Ghz. Multiple cores make little difference as long as only one can be assigned per network channel.
So even if the software acts as a miniature operating system and limits itself to network/disk operations, the amount of data flowing to storage can't be greater than P / (2 * N)[1] chunks per second. Although this hints the rough performance limit, I feel it's far from adequate.
What other considerations should one take estimating I/O performance in regards to processor frequency and other hardware specifics? For simplicity's sake, assume here that storage performs instantly under all circumstances.
[1] P - processor frequency; N - algorithm overhead
The hardware limiting factors are probably the I/O bus performance, say PCIe, and more recently, the FSB clock-rates, since memory controllers are moving from northbridge to the CPUs themselves.
Then, of course, you have to figure out what sort of processing you need to do on the input, and how much work it is to produce the output. These, at least for conventional software running on a CPU, are dependent on the processor clock, but not only. Writing your code to take advantage of the hardware facilities like caches, instruction-level parallelism, etc. is still a black art but can give you an order of magnitude performance boost.
Basically what I'm ranting about is that not all software is created equal, and you probably want to take that into account.
Likely, harddisk controllers will decide the harddisk I/O performance, graphics cards will decide maximum resolution and refresh I/O performance, and so on. Don't really understand the question, the CPU is becoming less and less involved in these kinds of things (well, has been for the last 10 years).
I doubt the question will even have bearing on CPUs with integrated GPUs, since the buffer to be output to screen is in external memory sharing a bus with (again) a controller on the motherboard.
It's all buffered, so I can only see CPUs affecting file performance if you somehow force the hardware buffer size to something insanely puny. Edit: and I'm pretty sure Apple will prevent you from doing such things. ;)
For Thunderbolt specifically, it's more about what the minimum CPU model is, that supports the kinds of bus speeds required by the Thunderbolt chip set version that is in the machine in question.
Thunderbolt is a raw data traffic system and performance specs are potential maximums, hence all the asterisks in the Apple specs. I believe it will indeed alleviate bottlenecks and in general give lag-free intelligent data shuffling doing many things simultaneously.
The CPU will idle-wait a shorter time for needed data, but the processing speed of the data is the same. When playing or creating a movie, codec processing time will be the same, but you will still feel a boost/lack of lag because the data is there when it needs it. For the I/O, the bottleneck will become the read/write speed of your harddisk instead, and the CPU bottleneck (for file copy operations, likely at least some code in Finder) will stay the same.
In other words, only CPU-intensive tasks such as for example movie encoding will benefit significantly from a faster CPU, while the benefits of Thunderbolt vs. a mix of interfaces will boost machines with both slow and fast CPUs.

Resources