What are some best tools out there to analyze DRAM access pattern? - memory-management

I want to analyze the memory access patterns (DRAM Accesses and not CPU cache) of a program. For instance, DRAM accesses performed by a program, DRAM hits, DRAM misses, etc. Is there any tool that could help to get such information?

Related

Policy for writing to memory

I understand that when need to deliver data to the CPU:
On a cache miss we access the cache, access the DRAM:
a) we copy the data from the DRAM back to the cache if it is a DRAM hit.
b) we copy the data from the disk to the DRAM and then from the DRAM to the cache.
On a cache hit we just access the cache.
What is the policy that we should use when we write to memory?
For example:
In every write cache hit do we update the cache, DRAM, and the disk?
For every write miss, do we write to the disk, read that disk block into DRAM,
and then read the DRAM block into the cache?
Most modern CPUs have cache so much faster than DRAM that write-back is the only policy that makes sense. Some older CPUs or modern embedded may have write-through CPU caches when the gap between on-chip cache and DRAM isn't so huge. Either way this is hardware-managed and invisible to software.
But writes always stop at DRAM, if/when they make it that far. The "backing store" on disk is not important when the page is in DRAM. If you want to think about DRAM as cache for a memory-mapped file (or the pagefile for anonymous memory), the only write policy that makes sense for performance is write-back!
Write-back to disk is managed by software, so implementing a write-through policy would require making every store trap to the OS after committing to DRAM, at which point the OS would have to run a bunch of code to initiate a SATA write command of the whole page. (And would have to do this without accessing any DRAM itself, otherwise how would those writes get in sync on disk? Or maybe you'd let yourself off the hook here because kernel memory is generally not pageable, so this kernel code is only backed by DRAM, not ultimately by disk pages.)
Even if disk-write was efficiently possible with byte or word granularity (which it very much isn't unless your "disk" is actually non-volatile RAM like 3D XPoint (e.g. Optane DC Persistent Memory), or battery-backed DRAM), just trapping every store would still destroy performance, like hundreds of times slower.
The gap between DRAM and disk has always been huge; hardware doesn't have mechanisms to make efficient write-through to "disk" possible. Other than modern non-volatile storage connected to the memory bus so it can be truly memory-mapped, like Linux mmap(MAP_SYNC). But then there's no plain DRAM in between cpu-cache and persistent NV-DRAM
I/O vs. DRAM performance; random DRAM writes (on a modern x86, using cache-bypassing NT stores) takes something like ~60ns with 64-byte granularity (for a burst write of a full cache line), including time spent getting the store from a CPU core to a memory controller. (60ns is actually something like the L3-miss load-use latency for reads but I'm going to assume something similar for NT stores.)
Random disk writes to a rotational magnetic disk take about 10ms, so that's about 6 orders of magnitude slower. And to even detect
Also, disk writes have a minimum size of usually 512 or 4096 bytes (1 hardware sector), so to write 1 byte or word, or a CPU cache line, would take a read-modify-write cycle for the disk.

Could multi-cpu access memory simultaneously in common home computer?

As far as I know, in modern mult-core cpu system, different cpus share one memory bus. Does that mean only one cpu could access the memory at one moment since there are only one memory bus which could not be used by more than one cpus at a time?
Yes, at the simplest level, a single memory bus will only be doing one thing at once. For memory busses, it's normal for them to be simplex (i.e. either loading or storing, not sending data in both directions at once like gigabit ethernet or PCIe).
Requests can be pipelined to minimize the gaps between requests, but transferring a cache-line of data takes multiple back-to-back cycles.
First of all, remember that when a CPU core "accesses the memory", they don't have to directly read from DRAM. The cache maintains a coherent view of memory shared by all cores, using (a variant of) the MESI cache coherency protocol.
Essential reading for the low-level details about how cache + memory works:
Ulrich Drepper's 2007 article What Every Programmer Should Know About Memory?, and my 2017 update on what's changed and what hasn't. e.g. a single core can barely saturate the memory controllers on a low-latency dual/quad core Intel CPU, and not even close on a many-core Xeon where max_concurrency / latency is the bottleneck, not the DRAM controller bandwidth. (Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?).
All high-performance / multi-core systems use caches, and normally every core has its own private L1i/L1d cache. In most modern multi-core CPUs, there are 2 levels of private cache per core, with a large shared cache. Earlier CPUs (like Intel Core2) only had private L1 caches, and the large shared last-level cache was L2.
Multi-level caches are essential to give low latency / high bandwidth for the most-hot data while still being large enough to have a high hit rate over a large working set.
Intel divides up their L3 caches into slices on the ring bus that connects cores together. So multiple accesses to different slices of L3 can happen simultaneously. See David Kanter's write-up of Sandybridge. Only on an L3 miss does the request need to be sent to a memory controller. (The memory controllers themselves have some buffering / reordering capability.)
Data written by one core can be read by another core without ever being written back to DRAM. A shared last-level cache acts as a backstop for shared data. (Intel CPUs with inclusive L3 cache also use it as a snoop filter to avoid broadcasting cache-coherency traffic to all cores: Which cache mapping technique is used in intel core i7 processor?).
But the writer will have the cache line in Modified state (and all other cores have it Invalid), so the reader has to request it from the writer to get it in Shared state. This is somewhat slow. See What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?, and What will be used for data exchange between threads are executing on one Core with HT?.
On modern Xeon multi-socket systems, I think it's still the case that dirty data can be sent between sockets without writing back to DRAM. But I'm not sure.
AMD Ryzen has separate L3 for each quad-core cluster, so data transfer between core-clusters is slower than within a single core cluster. (And if all the cores are working on the same data, it will end up replicated in the L3 of each cluster.)
Typical Intel/AMD desktop/laptop systems have dual-channel memory controllers, so (if both memory channels are populated) there can be two burst transfers in flight simultaneous, one to each DIMM.
But if only one channel is populated, or they're mismatched and the BIOS doesn't run them in dual-channel mode, or there are no outstanding accesses to cache lines that map to one of the channels, then memory parallelism is limited to pipelining access to one channel.
I know that modern cpu uses cache to achieve low lantency. So my question is based on the scene that when the computer was just started, there are no data in the cache, so cpus will fetch data directly from the memory
Nobody would design a multi-core system with no caches at all. That would be terribly inefficient because the cores would block each other from accessing the bus to fetch instructions as well as data, as you suspect
One fast CPU can do everything that two half-speed CPUs can do, and some things it can't (like run a single thread fast).
If you can build a CPU complex enough to support SMP operation, you can (and should) first make it support some cache. Maybe just internal tags for external data (for faster hit/miss checking), if we're talking about really old CPUs where the transistor budget for the whole chip was too low for much/any internal cache.
Or you could always have fully external cache outside the CPU, as part of an SMP interconnect. But the CPU has to know about it, at least to be able to mark some memory regions uncacheable so MMIO works, and (if it's not write-through) for consistent DMA. If you want private caches for each core, it can't just be a transparent memory-side cache (i.e. caching just the DRAM, not even seeing accesses to physical memory addresses that aren't backed by DRAM).
Multiple cores on a single piece of silicon only makes sense once you've pushed single-core performance to the point of diminishing returns with pipelining, caches, and superscalar execution. Maybe even out-of-order execution, although there are some multi-core in-order x86 and ARM chips. If running carefully-tuned code, out-of-order execution isn't always necessary for some kinds of problems. For example, GPUs don't use OoO exec because they're just designed for massive throughput with simple control.
Pipelining and caching can give huge speed improvements. See http://www.lighterra.com/papers/modernmicroprocessors/
Summary: it's generally possible for a single core to saturate the memory bus if memory access is all it does.
If you establish the memory bandwidth of your machine, you should be able to see if a single-threaded process can really achieve this and, if not, how the effective bandwidth use scales with the number of processors.
now I'll explain further.
it's all depends on the architecture you're using, for now, let's say modern SMP and SDRAM:
1) If two cores tried to access the same address in RAM
could go several ways:
they both want to read, simultaneously:
two cores on the same chip will probably share an intermediate cache
at some level (2 or 3), so the read will only be done once. On a
modern architecture, each core may be able to keep executing ยต-ops
from one or more pipelines until the cache line is ready
two cores on different chips may not share a cache, but still need to
co-ordinate access to the bus: ideally, whichever chip didn't issue
the read will simply snoop the response
if they both want to write:
two cores on the same chip will just be writing to the same cache,
and that only needs to be flushed to RAM once. In fact, since memory
will be read from and written to RAM per cache line, writes at
distinct but sufficiently close addresses can be coalesced into a
single write to RAM
two cores on different chips do have a conflict, and the cache line
will need to be written back to RAM by chip1, fetched into chip2's
cache, modified and then written back again (no idea whether the
write/fetch can be coalesced by snooping)
2) If two cores tried to access different addresses
For a single access, the CAS latency means two operations can potentially be interleaved to take no longer (or perhaps only a little longer) than if the bus were idle.

Operations that use only RAM

Can you please tell me some example code where we use ignorable amount of CPU and storage but heavy use of RAM? Like, if I run a loop and create objects, this will consume RAM but not CPU or storage. I mean tell me some memory expensive operations.
appzYourLife gave a good example, but I'd like to give a more conceptual answer.
Memory is slow. Like it's really slow, at least on the time scale that CPUs operate on. There is a concept called the memory hierarchy, which illustrates the trade off between cost/capacity and speed.
To prevent a fast CPU from wasting its time waiting on slow memory, we came up with CPU cache, which is a very small amount (it's expensive!) of very fast memory. The CPU never directly interacts with RAM, only the lowest level of CPU cache. Any time the CPU needs data that doesn't fall in the cache, it dispatches the memory controller to go fetch the desired data from RAM and put it in cache. The memory controller does this directly, without CPU involvement (so that the CPU can handle another process while wasting on this slow memory I/O).
The memory controller can be smart about how it does its memory fetching however. The principle of locality comes into play, which is the trend that CPUs tend to deal mostly with closely related (close in memory) data, such as arrays of data or long series of consecutive instructions. Knowing this, the memory controller can prefetch data from RAM that it predicts (according to various prediction algorithms, a key topic in CPU design) might be needed soon, and makes it available to the CPU before the CPU even knows it will need it. Think of it like a surgeon's assistant, who preempts what tools will be needed, and offers to hand them to the surgeon the moment they're needed, without the surgeon needing to request them, and without making the surgeon wait for the assistant to go get them and come back.
To maximize RAM usage, you'd need to minimize cache usage. This can be done by doing a lot of unexpected jumps between distant locations in memory. Typically, linked structures (such as linked lists) can cause this to happen. If a linked structure is composed of nodes that are scattered all throughout RAM, then there is no way for the memory controller to be able to predict all their locations and prefetch them. Traversing such a structure will cause many "cache misses" (a memory request for which the data isn't cached, and must be fetched from RAM), which are RAM intensive.
Ultimately, the CPU would usually be used heavily too, because it won't sit around waiting for the memory access, but will instead execute the instructions of the other processes running on the system, if there are any.
In Swift the Int64 type requires 64 bit of memory. So if you allocate space for 1000000 Int64 you will reserve memory for 8 MB.
UnsafeMutablePointer<Int64>.alloc(1000000)
The process should not consume much CPU since you are not initializing that memory, you are just allocating it.

DRAM cache miss

I read a paragraph about DRAM(main memory) cache miss and SRAM(L1,L2,L3) cache miss and I am not sure what it means.
Since DRAM is slower than SRAM, the cost for cache misses is expensive because DRAm cache misses are served from disk, while SRAM cache misses are usually served from DRAM based main memory.
Here is my understanding :
if there is a cache miss in DRAM, it goes into disk(second memory) to find datum. while if there is a cache miss in SRAM, it goes into SRAM to find the datum.
Could you tell if I am right or wrong ?
In general, if there's a miss at level L, you have to go one level further down, L+1.
A typical memory hierarchy comprises the following levels, from 0 onwards:
Processor registers
Processor caches (SRAM)
System memory (DRAM)
Mass storage (Flash/Spinning devices)
If you want to store something in a local register, you have to first fetch it from memory.
If your data is in one of the caches of the processor (SRAM), you don't need to go further down. If you have a cache miss however, you have to go to system memory (DRAM).
What happens here is that you might try to access a memory page which is not in memory, either because it has never been loaded or because at some point it has been swapped out. You have a page fault and you need to fetch your page from storage devices. This process stops as soon as you find your data.
Note that you want to avoid as much as possible access to slow storage drives, so what you can do it to create additional caching layers between DRAM and spinning disks by means of faster devices, e.g. SSDs (ZFS L2ARC, bcache etc)

What is locality of reference?

I am having problem in understanding locality of reference. Can anyone please help me out in understanding what it means and what is,
Spatial Locality of reference
Temporal Locality of reference
This would not matter if your computer was filled with super-fast memory.
But unfortunately that's not the case and computer-memory looks something like this1:
+----------+
| CPU | <<-- Our beloved CPU, superfast and always hungry for more data.
+----------+
|L1 - Cache| <<-- ~4 CPU-cycles access latency (very fast), 2 loads/clock throughput
+----------+
|L2 - Cache| <<-- ~12 CPU-cycles access latency (fast)
+----+-----+
|
+----------+
|L3 - Cache| <<-- ~35 CPU-cycles access latency (medium)
+----+-----+ (usually shared between CPU-cores)
|
| <<-- This thin wire is the memory bus, it has limited bandwidth.
+----+-----+
| main-mem | <<-- ~100 CPU-cycles access latency (slow)
+----+-----+ <<-- The main memory is big but slow (because we are cheap-skates)
|
| <<-- Even slower wire to the harddisk
+----+-----+
| harddisk | <<-- Works at 0,001% of CPU speed
+----------+
Spatial Locality
In this diagram, the closer data is to the CPU the faster the CPU can get at it.
This is related to Spacial Locality. Data has spacial locality if it is located close together in memory.
Because of the cheap-skates that we are RAM is not really Random Access, it is really Slow if random, less slow if accessed sequentially Access Memory SIRLSIAS-AM. DDR SDRAM transfers a whole burst of 32 or 64 bytes for one read or write command.
That is why it is smart to keep related data close together, so you can do a sequential read of a bunch of data and save time.
Temporal locality
Data stays in main-memory, but it cannot stay in the cache, or the cache would stop being useful. Only the most recently used data can be found in the cache; old data gets pushed out.
This is related to temporal locality. Data has strong temporal locality if it is accessed at the same time.
This is important because if item A is in the cache (good) than Item B (with strong temporal locality to A) is very likely to also be in the cache.
Footnote 1:
This is a simplification with latency cycle counts estimated from various cpus for example purposes, but give you the right order-of-magnitude idea for typical CPUs.
In reality latency and bandwidth are separate factors, with latency harder to improve for memory farther from the CPU. But HW prefetching and/or out-of-order exec can hide latency in some cases, like looping over an array. With unpredictable access patterns, effective memory throughput can be much lower than 10% of L1d cache.
For example, L2 cache bandwidth is not necessarily 3x worse than L1d bandwidth. (But it is lower if you're using AVX SIMD to do 2x 32-byte loads per clock cycle from L1d on a Haswell or Zen2 CPU.)
This simplified version also leaves out TLB effects (page-granularity locality) and DRAM-page locality. (Not the same thing as virtual memory pages). For a much deeper dive into memory hardware and tuning software for it, see What Every Programmer Should Know About Memory?
Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? explains why a multi-level cache hierarchy is necessary to get the combination of latency/bandwidth and capacity (and hit-rate) we want.
One huge fast L1-data cache would be prohibitively power-expensive, and still not even possible with as low latency as the small fast L1d cache in modern high-performance CPUs.
In multi-core CPUs, L1i/L1d and L2 cache are typically per-core private caches, with a shared L3 cache. Different cores have to compete with each other for L3 and memory bandwidth, but each have their own L1 and L2 bandwidth. See How can cache be that fast? for a benchmark result from a dual-core 3GHz IvyBridge CPU: aggregate L1d cache read bandwidth on both cores of 186 GB/s vs. 9.6 GB/s DRAM read bandwidth with both cores active. (So memory = 10% L1d for single-core is a good bandwidth estimate for desktop CPUs of that generation, with only 128-bit SIMD load/store data paths). And L1d latency of 1.4 ns vs. DRAM latency of 72 ns
It is a principle which states that if some variables are referenced
by a program, it is highly likely that the same might be referenced
again (later in time - also known as temporal locality) .
It is also highly likely that any consecutive storage in memory might
be referenced sooner (spatial locality)
First of all, note that these concepts are not universal laws, they are observations about common forms of code behavior that allow CPU designers to optimize their system to perform better over most of the programs. At the same time, these are properties that programmers seek to adopt in their programs as they know that's how memory systems are built and that's what CPU designers optimize for.
Spatial locality refers to the property of some (most, actually) applications to access memory in a sequential or strided manner. This usually stems from the fact that the most basic data structure building blocks are arrays and structs, both of which store multiple elements adjacently in memory. In fact, many implementations of data structures that are semantically linked (graphs, trees, skip lists) are using arrays internally to improve performance.
Spatial locality allows a CPU to improve the memory access performance thanks to:
Memory caching mechanisms such as caches, page tables, memory controller page are already larger by design than what is needed for a single access. This means that once you pay the memory penalty for bringing data from far memory or a lower level cache, the more additional data you can consume from it the better is your utilization.
Hardware prefetching which exists on almost all CPUs today often covers spatial accesses. Everytime you fetch addr X, the prefetcher will likely fetch the next cache line, and possibly others further ahead. If the program exhibits a constant stride, most CPUs would be able to detect that as well and extrapolate to prefetch even further steps of the same stride. Modern spatial prefetchers may even predict variable recurring strides (e.g. VLDP, SPP)
Temporal locality refers to the property of memory accesses or access patterns to repeat themselves. In the most basic form this could mean that if address X was once accessed it may also be accessed in the future, but since caches already store recent data for a certain duration this form is less interesting (although there are mechanisms on some CPUs aimed to predict which lines are likely to be accessed again soon and which are not).
A more interesting form of temporal locality is that two (or more) temporally adjacent accesses observed once, may repeat together again. That is - if you once accessed address A and soon after that address B, and at some later point the CPU detects another access to address A - it may predict that you will likely access B again soon, and proceed to prefetch it in advance.
Prefetchers aimed to extract and predict this type of relations (temporal prefetchers) are often using relatively large storage to record many such relations. (See Markov prefetching, and more recently ISB, STMS, Domino, etc..)
By the way, these concepts are in no way exclusive, and a program can exhibit both types of localities (as well as other, more irregular forms). Sometimes both are even grouped together under the term spatio-temporal locality to represent the "common" forms of locality, or a combined form where the temporal correlation connects spatial constructs (like address delta always following another address delta).
Temporal locality of reference - A memory location that has been used recently is more likely to be accessed again. For e.g., Variables in a loop. Same set of variables (symbolic name for a memory locations) being used for some i number of iterations of a loop.
Spatial locality of reference - A memory location that is close to the currently accessed memory location is more likely to be accessed. For e.g., if you declare int a,b; float c,d; the compiler is likely to assign them consecutive memory locations. So if a is being used then it is very likely that b, c or d will be used in near future. This is one way how cachelines of 32 or 64 bytes, help. They are not of size 4 or 8 bytes (typical size of int,float, long and double variables).

Resources