Is coalesced memory access a feature or phenomenon? - caching

I'm current writing a smaller project in OpenCL, and I'm trying to find out what really causes memory coalescing. Every book on GPGPU programming says it's how GPGPUs should be programmed, but not why the hardware would prefer this.
So is it some special hardware component which merges data transfers? Or is it simply to better utilize the cache? Or is it something completely different?

Memory coalescing makes several different things more efficient. It is usually done before the requests hit the cache. Similar to the SIMT execution model it is a architectural trade-off. It enables GPUs to have a more efficient and very high performance memory system but also forces programmers to think carefully about their data layout.
Without coalescing either the cache needs to be able to serve a huge number of requests at the same time or memory access would take a lot longer as the different data transfers would need to be handled one at a time. This is even relevant when just checking if something is a hit or a miss.
Merging requests is rather easy to do, you just pick one transfer and then merge all requests with matching upper address bits. You just generate a single request per cycle and replay the load or store instruction until all threads have been handled.
Caches also stores consecutive bytes, 32/64/128Byte, this fits most applications well, is a good fit to modern DRAM and reduces the overhead for cache bookeeping information: The cache is organized in cachelines and each cacheline has a tag that indicates which addresses are stored in the line.
Modern DRAM uses wide interfaces and also long bursts: The memory of a GPU is typically organized in 32-bit or 64-bit wide channels with GDDR5 memory that has a burst length of 8. This means that every transaction at the DRAM interface has to fetch at least 32-bit*8=32 byte or 64-bit*8=64 byte at a time, even if just a single byte is required from these bytes. Designing data layouts that lead to coalesced requests helps to use the DRAM interface efficiently.
GPUs also have a huge number of parallel threads active at the same time and rather small cache at the same time. CPUs are often able to use their caches to reorder their memory requests to DRAM friendly patterns. The larger number of threads and smaller caches on GPUs make this "cache based coalescing" less efficient on GPUs, as the data will often not stay long enough in the cache to get merged at the cache with other requests to the same cacheline.

Despite the "random access" name on "RAM" (Random-access Memory), Double-Data-Rate #3 Random-Access Memory (DDR3-RAM) is faster at accessing consecutive positions rather than randomly.
Case in point: "CAS Latency" is the amount of time that DDR3 RAM will stall when you're accessing a new "column", as your RAM chip is literally charging up to serve the new data from another location on the chip.
EDIT: Jan Lucas argues that RAS Latency is more important in practice. See his comment for details.
There's roughly a 10ns delay whenever you switch columns. So, if you have a bunch of memory accesses, if you keep access a bunch of data 'close' to each other, then you don't invoke a CAS delay.
So if you have 20-words to access at a particular location, its more efficient to access those 20-words before moving to a new memory location (invoking a CAS delay). Otherwise, you'll have to invoke ANOTHER CAS delay to "switch back" between memory locations.
Its just around 10 nanoseconds, but that amount of time adds up over time.

Related

Operations that use only RAM

Can you please tell me some example code where we use ignorable amount of CPU and storage but heavy use of RAM? Like, if I run a loop and create objects, this will consume RAM but not CPU or storage. I mean tell me some memory expensive operations.
appzYourLife gave a good example, but I'd like to give a more conceptual answer.
Memory is slow. Like it's really slow, at least on the time scale that CPUs operate on. There is a concept called the memory hierarchy, which illustrates the trade off between cost/capacity and speed.
To prevent a fast CPU from wasting its time waiting on slow memory, we came up with CPU cache, which is a very small amount (it's expensive!) of very fast memory. The CPU never directly interacts with RAM, only the lowest level of CPU cache. Any time the CPU needs data that doesn't fall in the cache, it dispatches the memory controller to go fetch the desired data from RAM and put it in cache. The memory controller does this directly, without CPU involvement (so that the CPU can handle another process while wasting on this slow memory I/O).
The memory controller can be smart about how it does its memory fetching however. The principle of locality comes into play, which is the trend that CPUs tend to deal mostly with closely related (close in memory) data, such as arrays of data or long series of consecutive instructions. Knowing this, the memory controller can prefetch data from RAM that it predicts (according to various prediction algorithms, a key topic in CPU design) might be needed soon, and makes it available to the CPU before the CPU even knows it will need it. Think of it like a surgeon's assistant, who preempts what tools will be needed, and offers to hand them to the surgeon the moment they're needed, without the surgeon needing to request them, and without making the surgeon wait for the assistant to go get them and come back.
To maximize RAM usage, you'd need to minimize cache usage. This can be done by doing a lot of unexpected jumps between distant locations in memory. Typically, linked structures (such as linked lists) can cause this to happen. If a linked structure is composed of nodes that are scattered all throughout RAM, then there is no way for the memory controller to be able to predict all their locations and prefetch them. Traversing such a structure will cause many "cache misses" (a memory request for which the data isn't cached, and must be fetched from RAM), which are RAM intensive.
Ultimately, the CPU would usually be used heavily too, because it won't sit around waiting for the memory access, but will instead execute the instructions of the other processes running on the system, if there are any.
In Swift the Int64 type requires 64 bit of memory. So if you allocate space for 1000000 Int64 you will reserve memory for 8 MB.
UnsafeMutablePointer<Int64>.alloc(1000000)
The process should not consume much CPU since you are not initializing that memory, you are just allocating it.

Does larger cache size always lead to improved performance?

Since cache inside the processor increases the instruction execution speed. I'm wondering what if we increase the size of cache to many MBs like 1 GB. Is it possible? If it is will increasing the cache size always result in increased performance?
There is a tradeoff between cache size and hit rate on one side and read latency with power consumption on another. So the answer to your first question is: technically (probably) possible, but unlikely to make sense, since L3 cache in modern CPUs with size of just a few MBs has read latency of about dozens of cycles.
Performance depends more on memory access pattern than on cache size. More precisely, if the program is mainly sequential, cache size is not a big deal. If there are quite a lot of random access (ex. when associative containers are actively used), cache size really matters.
The above is true for single computational tasks. In multiprocess environment with several active processes bigger cache size is always better, because of decrease of interprocess contention.
This is a simplification, but, one of the primary reasons the cache increases 'speed' is that it provides a fast memory very close to the processor - this is much faster to access than main memory. So, in theory, increasing the size of the cache should allow more information to be stored in this 'fast' memory, and thereby improve performance.. In the real world things are obviously much more complex than this, and there will of course be added complexity, and cost, associated with such a large cache, and with dealing with issues like cache coherency, caching algorithms etc.
As cache stores data temporary. Cache is used to locate the file easily that has been frequently using. So if the size of cache increased upto 1gb or more it will not stay as cache, it becomes RAM. Data is stored in ram temporary. So if cache isn't used, when data is called by processor, ram will take time to fetch data to provide to the processor because of its wide size of 4gb or more. So we use cache as our temporary memory for the things we recently or frequently used. In this way, ram ram doesnt required to find and fetch data to give it to processor, because processor direct access data from cache, because of small size of cache, it doesnt take time to find data, and processor doesn't require to call ram to fetch data, all of this done fastly without ram. Lets take an example, we have a wide classroom (RAM) , our principal (processor) call class CR (Data) for some purposes, then ones will go to the class room and will find the CR in the class of 1000 students and take him to the principal. It takes time. When we specify a space(cache) for CR in the class, because principal mostly call CR of the class, so it will become easy to find CR becuase most of the time CR is called by Principal.

If we have infinite memory, then do we still be needing paging?

Paging creates illusion that each process has infinite RAM by moving pages to and from disk. So if we have infinite memory(in some hypothetical situation), do we still need Paging? If yes, then why? I faced this question in an interview.
Assuming that "infinite memory" means infinite randomly accessable memory, or RAM, we will still need paging. Although paging is often associated with the ability to swap pages in and out of RAM to a hard disk to conserve memory, this is merely one aspect of paging. Here are some other reasons to have paging:
Security. Paging is a method to enforce operating system security and memory protection by ensuring that a processes cannot access the memory of another process and that it cannot modify the resident kernel.
Multitasking. Paging aids in multitasking by virtualizing the memory space, that is, address 0xFOO in Process A can be something completely different than 0xFOO in Process B
Memory Allocation. Paging aids in memory allocation by reducing fragmentation and ensuring RAM is only allocated when accessed. What this means is that although a process needs, suppose, 100MB of continuous RAM space, this need not be continuous physically. Additionally, when a program requests 100MB of space, the operating system will tell the program it's safe to use that 100MB of space, yet it will not be actually allocated until the program uses that space to its fullest.
Admittedly, the latter would not be entirely necessary if one had infinite RAM, nonetheless; it is always good practice to be eifficient even when we are not resource constrained. It also demonstrates a use for paging that is sometimes not considered.
This is a philosophical question, so here's a philosophical answer :)
The trick in this question is that you make assumptions about the infinite memory. It's ok to say "no, no need to use paging, BUT". And follow by:
The infinite memory has to be accessible within the acceptable time limit for memory access. If it's not (because infinity takes a lot of space, and the memory sits farther away from the processing unit), there's no difference between it and the disk, both are not satisfying the readily available memory requirement, which is what caching via pages tries to solve.
Take for example Amazon's S3, which for all practical purposes is infinite. If you can rely on S3 to satisfy all your memory requirements in the sense that when you need something fetched within time x you can fetch it from S3, there's no need to page anything or even hold it in "local" memory. Just get it from S3 whenever you need it, as many times as you need it. (Obviously this would have other repercussions like cost and network, but let's ignore that for now).
Of course you can always say that optimally you want memory access to be as fast as possible, and "fast enough" is probably slower than "fastest", so local memory access would give you better performance etc.
And last, if I had to envision a memory which is infinite and has the same access time no matter how "far" the memory unit is from the fetching unit, I would have to envision a sphere where the processing unit is in the middle, so that you can't argue that one memory unit is slower than the other because of the distance. Otherwise you could say that paging would be done internally within the memory so that access is faster to the memory units that are most used (or whatever algorithm you choose to use).

CUDA: When to use shared memory and when to rely on L1 caching?

After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background?
Is shared memory simply there to let algorithms designed for CC < 2.0 run efficiently without modifications?
To collaborate via shared memory, threads in a block write to shared memory and synchronize with __syncthreads(). Why not simply write to global memory (through L1), and synchronize with __threadfence_block()? The latter option should be easier to implement since it doesn't have to relate to two different locations of values, and it should be faster because there is no explicit copying from global to shared memory. Since the data gets cached in L1, threads don't have to wait for data to actually make it all the way out to global memory.
With shared memory, one is guaranteed that a value that was put there remains there throughout the duration of the block. This is as opposed to values in L1, which get evicted if they are not used often enough. Are there any cases where it's better too cache such rarely used data in shared memory than to let the L1 manage them based on the usage pattern that the algorithm actually has?
2 big reasons why automatic caching is less efficient than manual scratch pad memory (applies to CPUs as well)
parallel accesses to random addresses are more efficient. Example: histogramming. Let's say you want to increment N bins, and each are > 256 bytes apart. Then due to coalescing rules, that will result in N serial reads/writes since global and cache memory is organized in large ~256byte blocks. Shared memory doesn't have that problem.
Also to access global memory, you have to do virtual to physical address translation. Having a TLB that can do lots of translations in || will be quite expensive. I haven't seen any SIMD architecture that actually does vector loads/stores in || and I believe this is the reason why.
avoids writing back dead values to memory, which wastes bandwidth & power. Example: in an image processing pipeline, you don't want your intermediate images to get flushed to memory.
Also, according to an NVIDIA employee, current L1 caches are write-through (immediately writes to L2 cache), which will slow down your program.
So basically, the caches get in the way if you really want performance.
As far as i know, L1 cache in a GPU behaves much like the cache in a CPU. So your comment that "This is as opposed to values in L1, which get evicted if they are not used often enough" doesn't make much sense to me
Data on L1 cache isn't evicted when it isn't used often enough. Usually it is evicted when a request is made for a memory region that wasn't previously in cache, and whose address resolves to one that is already in use. I don't know the exact caching algorithm employed by NVidia, but assuming a regular n-way associative, then each memory entry can only be cached in a small subset of the entire cache, based on it's address
I suppose this may also answer your question. With shared memory, you get full control as to what gets stored where, while with cache, everything is done automatically. Even though the compiler and the GPU can still be very clever in optimizing memory accesses, you can sometimes still find a better way, since you're the one who knows what input will be given, and what threads will do what (to a certain extent of course)
Caching data through several memory layers always needs to follow a cache-coherency protocol. There are several such protocols and the decision on which one is the most suitable is always a trade off.
You can have a look at some examples:
Related to GPUs
Generally for computing units
I don't want to get in many details, because it is a huge domain and I am not an expert. What I want to point out is that in a shared-memory system (here the term shared does not refer to the so called shared memory of GPUs) where many compute-units (CUs) need data concurrently there is a memory protocol that attempts to keep the data close to the units so that can fetch them as fast as possible. In the example of a GPU when many threads in the same SM (symmetric multiprocessor) access the same data there should be a coherency in the sense that if thread 1 reads a chunk of bytes from the global memory and in the next cycle thread 2 is going to access these data, then an efficient implementation would be such that thread 2 is aware that data are found already in L1 cache and can access it fast. This is what the cache coherency protocol attempts to achieve, to let all compute units be up to date with what data exist in caches L1, L2 and so on.
However, keeping threads up to date, or else, keeping threads in coherent states, comes at some cost which is essentially missing cycles.
In CUDA by defining the memory as shared rather than L1-cache you free it from that coherency protocol. So access to that memory (which is physically the same piece of whatever material it is) is direct and does not implicitly call the functionality of coherency protocol.
I don't know how fast should this be, I didn't perform any such benchmark but the idea is that since you don't pay anymore for this protocol the access should be faster!
Of course, the shared memory on NVIDIA GPUs is split in banks and if someone wants to use it for performance improvement should have a look at this before. The reason is bank conflicts that occur when two threads access the same bank and this causes serialization of the access..., but that's another thing link

Why do we bother with CPU registers in assembly, instead of just working directly with memory?

I have a basic question about assembly.
Why do we bother doing arithmetic operations only on registers if they can work on memory as well?
For example both of the following cause (essentially) the same value to be calculated as an answer:
Snippet 1
.data
var dd 00000400h
.code
Start:
add var,0000000Bh
mov eax,var
;breakpoint: var = 00000B04
End Start
Snippet 2
.code
Start:
mov eax,00000400h
add eax,0000000bh
;breakpoint: eax = 0000040B
End Start
From what I can see most texts and tutorials do arithmetic operations mostly on registers. Is it just faster to work with registers?
If you look at computer architectures, you find a series of levels of memory. Those that are close to the CPU are the fast, expensive (per a bit), and therefore small, while at the other end you have big, slow and cheap memory devices. In a modern computer, these are typically something like:
CPU registers (slightly complicated, but in the order of 1KB per a core - there
are different types of registers. You might have 16 64 bit
general purpose registers plus a bunch of registers for special
purposes)
L1 cache (64KB per core)
L2 cache (256KB per core)
L3 cache (8MB)
Main memory (8GB)
HDD (1TB)
The internet (big)
Over time, more and more levels of cache have been added - I can remember a time when CPUs didn't have any onboard caches, and I'm not even old! These days, HDDs come with onboard caches, and the internet is cached in any number of places: in memory, on the HDD, and maybe on caching proxy servers.
There is a dramatic (often orders of magnitude) decrease in bandwidth and increase in latency in each step away from the CPU. For example, a HDD might be able to be read at 100MB/s with a latency of 5ms (these numbers may not be exactly correct), while your main memory can read at 6.4GB/s with a latency of 9ns (six orders of magnitude!). Latency is a very important factor, as you don't want to keep the CPU waiting any longer than it has to (this is especially true for architectures with deep pipelines, but that's a discussion for another day).
The idea is that you will often be reusing the same data over and over again, so it makes sense to put it in a small fast cache for subsequent operations. This is referred to as temporal locality. Another important principle of locality is spatial locality, which says that memory locations near each other will likely be read at about the same time. It is for this reason that reading from RAM will cause a much larger block of RAM to be read and put into on-CPU cache. If it wasn't for these principles of locality, then any location in memory would have an equally likely chance of being read at any one time, so there would be no way to predict what will be accessed next, and all the levels of cache in the world will not improve speed. You might as well just use a hard drive, but I'm sure you know what it's like to have the computer come to a grinding halt when paging (which is basically using the HDD as an extension to RAM). It is conceptually possible to have no memory except for a hard drive (and many small devices have a single memory), but this would be painfully slow compared to what we're familiar with.
One other advantage of having registers (and only a small number of registers) is that it lets you have shorter instructions. If you have instructions that contain two (or more) 64 bit addresses, you are going to have some long instructions!
Because RAM is slow. Very slow.
Registers are placed inside the CPU, right next to the ALU so signals can travel almost instantly. They're also the fastest memory type but they take significant space so we can have only a limited number of them. Increasing the number of registers increases
die size
distance needed for signals to travel
work to save the context when switching between threads
number of bits in the instruction encoding
Read If registers are so blazingly fast, why don't we have more of them?
More commonly used data will be placed in caches for faster accessing. In the past caches are very expensive so they're an optional part and can be purchased separately and plug into a socket outside the CPU. Nowadays they're often in the same die with the CPUs. Caches are constructed from SRAM cells which are smaller than register cells but maybe tens or hundreds of times slower.
Main memory will be made from DRAM which needs only one transistor per cell but are thousands of times slower than registers, hence we can't work with only DRAM in a high-performance system. However some embedded system do make use of register file so registers are also main memory
More information: Can we have a computer with just registers as memory?
Registers are much faster and also the operations that you can perform directly on memory are far more limited.
In real, there are tiny implementations that does not separate registers from memory. They can expose it, for example, in the way they have 512 bytes of RAM, and first 64 of them are exposed as 32 16-bit registers and in the same time accessible as addressable RAM. Or, another example, MosTek 6502 "zero page" (RAM range 0-255, accessed used 1-byte address) was a poor substitution for registers, due to small amount of real registers in CPU. But, this is poorly scalable to larger setups.
The advantage of registers are following:
They are the most fast. They are faster in a typical modern system than any cache, more so than DRAM. (In the example above, RAM is likely SRAM. But SRAM of a few gigabytes is unusably expensive.) And, they are close to processor. Difference of time between register access and DRAM access can reach values like 200 or even 1000. Even compared to L1 cache, register access is typically 2-4 times faster.
Their amount is limited. A typical instruction set will become too bloated if any memory location is addressed explicitly.
Registers are specific to each CPU (core, hardware thread, hart) separately. (In systems where fixed RAM addresses serve role of special registers, as e.g. zSeries does, this needs special remapping of such service area in absolute addresses, separate for each core.)
In the same manner as (3), registers are specific to each process thread without a need to adjust locations in code for a thread.
Registers (relatively easily) allow specific optimizations, as register renaming. This is too complex if memory addresses are used.
Additionally, there are registers that could not be implemented in separate block RAM because access to RAM needs their change. I mean the "execution phase" register in the simplest CPU designs, which takes values like "instruction extracting phase", "instruction decoding phase", "ALU phase", "data writing phase" and so on, and this register equivalents in more complicated (pipeline, out-of-order) designs; also different buffer registers on bus access, and so on. But, such registers are not visible to programmer, so you did likely not mean them.
x86, like pretty much every other "normal" CPU you might learn assembly for, is a register machine1. There are other ways to design something that you can program (e.g. a Turing machine that moves along a logical "tape" in memory, or the Game of Life), but register machines have proven to be basically the only way to go for high-performance.
https://www.realworldtech.com/architecture-basics/2/ covers possible alternatives like accumulator or stack machines which are also obsolete now. Although it omits CISCs like x86 which can be either load-store or register-memory. x86 instructions can actually be reg,mem; reg,reg; or even mem,reg. (Or with an immediate source.)
Footnote 1: The abstract model of computation called a register machine doesn't distinguish between registers and memory; what it calls registers are more like memory in real computers. I say "register machine" here to mean a machine with multiple general-purpose registers, as opposed to just one accumulator, or a stack machine or whatever. Most x86 instructions have 2 explicit operands (but it varies), up to one of which can be memory. Even microcontrollers like 6502 that can only really do math into one accumulator register almost invariably have some other registers (e.g. for pointers or indices), unlike true toy ISAs like Marie or LMC that are extremely inefficient to program for because you need to keep storing and reloading different things into the accumulator, and can't even keep an array index or loop counter anywhere that you can use it directly.
Since x86 was designed to use registers, you can't really avoid them entirely, even if you wanted to and didn't care about performance.
Current x86 CPUs can read/write many more registers per clock cycle than memory locations.
For example, Intel Skylake can do two loads and one store from/to its 32KiB 8-way associative L1D cache per cycle (best case), but can read upwards of 10 registers per clock, and write 3 or 4 (plus EFLAGS).
Building an L1D cache with as many read/write ports as the register file would be prohibitively expensive (in transistor count/area and power usage), especially if you wanted to keep it as large as it is. It's probably just not physically possible to build something that can use memory the way x86 uses registers with the same performance.
Also, writing a register and then reading it again has essentially zero latency because the CPU detects this and forwards the result directly from the output of one execution unit to the input of another, bypassing the write-back stage. (See https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing).
These result-forwarding connections between execution units are called the "bypass network" or "forwarding network", and it's much easier for the CPU to do this for a register design than if everything had to go into memory and back out. The CPU only has to check a 3 to 5 bit register number, instead of an 32-bit or 64-bit address, to detect cases where the output of one instruction is needed right away as the input for another operation. (And those register numbers are hard-coded into the machine-code, so they're available right away.)
As others have mentioned, 3 or 4 bits to address a register make the machine-code format much more compact than if every instruction had absolute addresses.
See also https://en.wikipedia.org/wiki/Memory_hierarchy: you can think of registers as a small fast fixed-size memory space separate from main memory, where only direct absolute addressing is supported. (You can't "index" a register: given an integer N in one register, you can't get the contents of the Nth register with one insn.)
Registers are also private to a single CPU core, so out-of-order execution can do whatever it wants with them. With memory, it has to worry about what order things become visible to other CPU cores.
Having a fixed number of registers is part of what lets CPUs do register-renaming for out-of-order execution. Having the register-number available right away when an instruction is decoded also makes this easier: there's never a read or write to a not-yet-known register.
See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an explanation of register renaming, and a specific example (the later edits to the question / later parts of my answer showing the speedup from unrolling with multiple accumulators to hide FMA latency even though it reuses the same architectural register repeatedly).
The store buffer with store forwarding does basically give you "memory renaming". A store/reload to a memory location is independent of earlier stores and load to that location from within this core. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
Repeated function calls with a stack-args calling convention, and/or returning a value by reference, are cases where the same bytes of stack memory can be reused multiple times.
The seconds store/reload can execute even if the first store is still waiting for its inputs. (I've tested this on Skylake, but IDK if I ever posted the results in an answer anywhere.)
Registers are accessed way faster than RAM memory, since you don't have to access the "slow" memory bus!
We use registers because they are fast. Usually, they operate at CPU's speed.
Registers and CPU cache are made with different technology / fabrics and
they are expensive. RAM on the other hand is cheap and 100 times slower.
Generally speaking register arithmetic is much faster and much preferred. However there are some cases where the direct memory arithmetic is useful.
If all you want to do is increment a number in memory (and nothing else at least for a few million instructions) then a single direct memory arithmetic instruction is usually slightly faster than load/add/store.
Also if you are doing complex array operations you generally need a lot of registers to keep track of where you are and where your arrays end. On older architectures you could run out of register really quickly so the option of adding two bits of memory together without zapping any of your current registers was really useful.
Yes, it's much much much faster to use registers. Even if you only consider the physical distance from processor to register compared to proc to memory, you save a lot of time by not sending electrons so far, and that means you can run at a higher clock rate.
Yes - also you can typically push/pop registers easily for calling procedures, handling interrupts, etc
It's just that the instruction set will not allow you to do such complex operations:
add [0x40001234],[0x40002234]
You have to go through the registers.

Resources