Memory performance vs Utilization

Memory performance vs Utilization - performance

Question: How is memory (RAM) performance (read/write speeds, etc.) affected by total utilization.
Background:
I am curious if there is a performance impact for reading/writing to system memory based on the overall utilization of that memory
If performance degrades at higher utilization, what is the relationship between utilization and performance? Is this linear? Or at some point is there a significant drop in performance?
If there is a drop in performance with higher utilization, is there a point at which it becomes faster to use swap on an SSD on a SATA bus? Where does this point occur?
Outcome:
All else being equal, I'm curious if there should be a specific target for memory utilization to get the best performance from a machine as on the one hand, having more stuff in system memory should make things faster than having to read from disk, but at some point surely, the overall memory performance is materially affected by some overhead from high memory utilization right?

This sounds a bit like a superuser.com question, not stackoverflow.
Time to allocate new memory might increase slightly as the system approaches 100% full.
If you don't have any swap space, Linux will pick a processes using a lot of RAM and kill it pre-emptively when the system is approaching OOM. (google oom-killer.)
Access time to already allocated memory is not at all influenced by the fraction of total memory in use. A program that uses 1GB of memory with some specific access pattern will show the same performance on a machine with 2G vs. a machine with 16GB of RAM.
Virtual->physical mappings are defined by page tables, which by themselves could give slower performance for lookups when more memory is allocated to a process. (each process has its own page table). Again, this is not %-full dependent, simply size. However, these lookups need to be cached by the CPU hardware TLB.
See Ulrich Drepper's What Every Programmer Should Know About Memory for more background on this stuff.

Related

Need help understanding a Cache concept

I've been reading a lot about operating systems and all lately. I understand how cache works and what is it used for.
However, when I asked a question to myself, I was unable to find an answer.
If a cache can be made as large as the device it is caching (for instance, a cache as large as a disk), why not do so and eliminate the device.

Assuming the cache is in-memory, it's not persistent, meaning you'll lose it once your machine is reboot. If nothing else, you'll need persistent storage (disk) to avoid losing data on restarts and power loses.

For a disk cache, there's no technical limit why a battery-backed RAM cache or flash storage usually used for HDD cache can't be combined together to provide terabytes of capacity. Your average user won't be using those RAM-based storage products because it's far more expensive than normal storage and they still consume too much power to leave unplugged for hours. We are leaving SSHD behind for full flash storage, but even they have RAM cache to reduce & consolidate IO calls.
For CPU cache, there's actually a practical limit to it, they're taking precious physical space that can be used for other unit, increasing distance (thus latency), and generate heat (which put a hard limit no matter how large is your cooling budget). That's why CPU cache is multilevel instead, with the small L1 tightly coupled to a core, the bigger L2 shared by multiple cores, and the slower but even bigger L3 with the slower but cheaper component. Even with an unlimited budget and an exotic process, a multilevel cache CPU will have better performance than a single-level cache CPU that's forced to put everything on (relatively) far silicon and suffering from the latency.

Does the NT Kernel not utilise Multiple Memory Channel Architecture?

I've been reading benchmarks that test the benefits of systems with Multiple Memory Channel Architectures. The general conclusion of most of these benchmarks is that the performance benefits of systems with greater numbers of memory channels over those systems with fewer channels are negligible.
However nowhere have I found an explanation of why this is the case, just benchmark results indicating that this is the real world performance attained.
The theory is that every doubling of the system's memory channels doubles the bandwidth of memory access, so in theory there should be a performance gain, however in real world applications the gains are negligible. Why?
My postulation is that when the NT Kernel allocates physical memory it is not disturbing the allocations evenly across the the memory channels. If all of a process's virtual memory is mapped to a single memory channel within a MMC system then the process will effectively only be able to attain the performance of having a single memory channel at its disposal. Is this the reason for negligible real world performance gains?
Naturally a process is allocated virtual memory and the kernel allocates physical memory pages, so is this negligible performance gain the fault of the NT Kernel not distributing allocations across the available channels?

related: Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? two memory controllers is sufficient for single-threaded memory bandwidth. Only if you have multiple threads / processes that all miss in cache a lot do you start to benefit from the extra memory controllers in a big Xeon.
(e.g. your example from comments of running many independent image-processing tasks on different images in parallel might do it, depending on the task.)
Going from two down to one DDR4 channel could hurt even a single-threaded program on a quad-core if it was bottlenecked on DRAM bandwidth a lot of the time, but one important part of tuning for performance is to optimize for data reuse so you get at least L3 cache hits.
Matrix multiplication is a classic example: instead of looping over rows / columns of the whole matrix N^2 times (which is too big to fit in cache) (one row x column dot product for each output element), you break the work up into "tiles" and compute partial results, so you're looping repeatedly over a tile of the matrix that stays hot in L1d or L2 cache. (And you hopefully bottleneck on FP ALU throughput, running FMA instructions, not memory at all, because matmul takes O(N^3) multiply+add operations over N^2 elements for a square matrix.) These optimizations are called "loop tiling" or "cache blocking".
So well-optimized code that touches a lot of memory can often get enough work done as its looping that it doesn't actually bottleneck on DRAM bandwidth (L3 cache miss) most of the time.
If a single channel of DRAM is enough to keep up with hardware prefetch requests for how quickly/slowly the code is actually touching new memory, there won't be any measureable slowdown from memory bandwidth. (Of course that's not always possible, and sometimes you do loop over a big array doing not very much work or even just copying it, but if that only makes up a small fraction of the total run time then it's still not significant.)

The theory is that every doubling of the system's memory channels doubles the bandwidth of memory access, so in theory there should be a performance gain, however in real world applications the gains are negligible. Why?
Think of it as a hierarchy, like "CPU <-> L1 cache <-> L2 cache <-> L3 cache <-> RAM <-> swap space". RAM bandwidth only matters when L3 cache wasn't big enough (and swap space bandwidth only matters if RAM wasn't big enough, and ...).
For most (not all) real world applications, the cache is big enough, so RAM bandwidth isn't important and the gains (of multi-channel) are negligible.
My postulation is that when the NT Kernel allocates physical memory it is not disturbing the allocations evenly across the the memory channels.
It doesn't work like that. The CPU mostly only works with whole cache lines (e.g. 64 byte pieces); and with one channel the entire cache line comes from one channel; and with 2 channels half of a cache line comes from one channel and the other half comes from a different channel. There is almost nothing that any software can do that will make any difference. The NT kernel only works with whole pages (e.g. 4 KiB pieces), so whatever the kernel does is even less likely to matter (until you start thinking about NUMA optimizations, which is a completely different thing).

Operations that use only RAM

Can you please tell me some example code where we use ignorable amount of CPU and storage but heavy use of RAM? Like, if I run a loop and create objects, this will consume RAM but not CPU or storage. I mean tell me some memory expensive operations.

appzYourLife gave a good example, but I'd like to give a more conceptual answer.
Memory is slow. Like it's really slow, at least on the time scale that CPUs operate on. There is a concept called the memory hierarchy, which illustrates the trade off between cost/capacity and speed.
To prevent a fast CPU from wasting its time waiting on slow memory, we came up with CPU cache, which is a very small amount (it's expensive!) of very fast memory. The CPU never directly interacts with RAM, only the lowest level of CPU cache. Any time the CPU needs data that doesn't fall in the cache, it dispatches the memory controller to go fetch the desired data from RAM and put it in cache. The memory controller does this directly, without CPU involvement (so that the CPU can handle another process while wasting on this slow memory I/O).
The memory controller can be smart about how it does its memory fetching however. The principle of locality comes into play, which is the trend that CPUs tend to deal mostly with closely related (close in memory) data, such as arrays of data or long series of consecutive instructions. Knowing this, the memory controller can prefetch data from RAM that it predicts (according to various prediction algorithms, a key topic in CPU design) might be needed soon, and makes it available to the CPU before the CPU even knows it will need it. Think of it like a surgeon's assistant, who preempts what tools will be needed, and offers to hand them to the surgeon the moment they're needed, without the surgeon needing to request them, and without making the surgeon wait for the assistant to go get them and come back.
To maximize RAM usage, you'd need to minimize cache usage. This can be done by doing a lot of unexpected jumps between distant locations in memory. Typically, linked structures (such as linked lists) can cause this to happen. If a linked structure is composed of nodes that are scattered all throughout RAM, then there is no way for the memory controller to be able to predict all their locations and prefetch them. Traversing such a structure will cause many "cache misses" (a memory request for which the data isn't cached, and must be fetched from RAM), which are RAM intensive.
Ultimately, the CPU would usually be used heavily too, because it won't sit around waiting for the memory access, but will instead execute the instructions of the other processes running on the system, if there are any.

In Swift the Int64 type requires 64 bit of memory. So if you allocate space for 1000000 Int64 you will reserve memory for 8 MB.
UnsafeMutablePointer<Int64>.alloc(1000000)
The process should not consume much CPU since you are not initializing that memory, you are just allocating it.

DRAM and its effects on real world performance

After learning a little on how computer programs run I had some thoughts concerning the cpu and RAM. After watching a few youtube videos (linus tech tips and others) they all seem to show that increasing a RAM speed (frequency) does not really have much of a performance improvement in real world applications and games on a general desktop computer. My first question is why is this? Is it because of the high hit rates (95% and above) of the cpu's cache on most modern cpus? Which in turn would lead to less and less need for the cpu to reach out to ram? Also, in which situations would faster RAM frequency be beneficial?

NOTE: this is a very broad question, and the answers can vary very differently depending on the architecture/OS running the system. I am answering from a best-judgement standpoint on how these things generally work
Why is there not a larger performance difference between different RAM clock speeds?
I would imagine that the clock speed of the RAM of the computer matters less than the clock speed of the CPU cache. Because:
the CPU gets its instructions from the cache, not straight from RAM
with the larger cache sizes of modern CPU's, it is less necessary to need to go out to RAM as often.
When the cache needs to go out to RAM, it uses an asynchronous processor (DMA) to grab more information, allowing the CPU to switch to a different process entirely.
Besides that, the clock speed of the motherboard's various pipelines (DMA) could be creating a chokepoint where it is slowing the transfer rate of the information overall.
which situations would faster RAM frequency be beneficial?
I would say that overall, any one of the core pieces of hardware involved with memory and its use and transfer (CPU, CPU Cache, The Various memory pipelines, the various memory transfer devices (DMA, etc.), the RAM itself) can cause a chokepoint where faster RAM might or might not affect the overall performance. It is really a by-case issue.

CPU Utilization and threads

We have a transaction intensive process at one customer site running on a quad core server with four processors. The process is designed to take advantage of every core available. So in this installation, we take an input queue, divide it by 16th's and allocate each fraction of the queue to a core. It works well and keeps up with the transaction volume on the box.
Looking at the CPU utilization on the box, it never seems to go above 33%. Now we have a new customer with at least double the volume of the existing customer. Some of us are arguing that since CPU usage is way below maximum utilization, that we should go with the same configuration.
Others claim that there is no direct correlation between cpu utilization and transaction processing speed and since the logic of the underlying software module is based on the number of available cores, that it makes sense to obtain a box with proportionately more cores available for the new client to accommodate the increased traffic volume.
Does anyone have a sense as to who is right in this instance?
Thank you,

To determine the optimum configuration for your new customer, understanding the reason for low CPU usage is paramount.
Very likely, the reason is one of the following:
Your process is limited by memory bandwidth. In this case, faster RAM will help if supported by the motherboard. If possible, a redesign to limit the amount of data accessed during processing will improve performance. Adding more CPU cores will, on its own, do nothing to improve performance.
Your process is limited by disk I/O. Using faster disk connections (SATA etc.) and/or upgrading to a SSD might help, but more CPU power will not.
Your process is limited by synchronization contention. In this case, adding more threads for more cores might even be counter productive. Redesigning your algorithm might help in this case.
Having said this, I have also seen situations where processes that are definitely CPU bound fail to achieve 100% CPU usage on modern processors (Core i7 etc.) because in certain turbo boost relevant cases, task manager will show less than 100%.
As 9000 said, you need to find out what your bottlenecks are when under load. Perfmon might provide enough data to find out.
Another afterthought: You could limit your process on the existing machine to part of the cores (but still at least 30% so that theoretically, CPU doesn't become a bottleneck due to this limitation) and check if overall throughput degrades. If it does not, adding more cores will not improve performance.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio