I've been reading a lot about operating systems and all lately. I understand how cache works and what is it used for.
However, when I asked a question to myself, I was unable to find an answer.
If a cache can be made as large as the device it is caching (for instance, a cache as large as a disk), why not do so and eliminate the device.
Assuming the cache is in-memory, it's not persistent, meaning you'll lose it once your machine is reboot. If nothing else, you'll need persistent storage (disk) to avoid losing data on restarts and power loses.
For a disk cache, there's no technical limit why a battery-backed RAM cache or flash storage usually used for HDD cache can't be combined together to provide terabytes of capacity. Your average user won't be using those RAM-based storage products because it's far more expensive than normal storage and they still consume too much power to leave unplugged for hours. We are leaving SSHD behind for full flash storage, but even they have RAM cache to reduce & consolidate IO calls.
For CPU cache, there's actually a practical limit to it, they're taking precious physical space that can be used for other unit, increasing distance (thus latency), and generate heat (which put a hard limit no matter how large is your cooling budget). That's why CPU cache is multilevel instead, with the small L1 tightly coupled to a core, the bigger L2 shared by multiple cores, and the slower but even bigger L3 with the slower but cheaper component. Even with an unlimited budget and an exotic process, a multilevel cache CPU will have better performance than a single-level cache CPU that's forced to put everything on (relatively) far silicon and suffering from the latency.
Related
I'm current writing a smaller project in OpenCL, and I'm trying to find out what really causes memory coalescing. Every book on GPGPU programming says it's how GPGPUs should be programmed, but not why the hardware would prefer this.
So is it some special hardware component which merges data transfers? Or is it simply to better utilize the cache? Or is it something completely different?
Memory coalescing makes several different things more efficient. It is usually done before the requests hit the cache. Similar to the SIMT execution model it is a architectural trade-off. It enables GPUs to have a more efficient and very high performance memory system but also forces programmers to think carefully about their data layout.
Without coalescing either the cache needs to be able to serve a huge number of requests at the same time or memory access would take a lot longer as the different data transfers would need to be handled one at a time. This is even relevant when just checking if something is a hit or a miss.
Merging requests is rather easy to do, you just pick one transfer and then merge all requests with matching upper address bits. You just generate a single request per cycle and replay the load or store instruction until all threads have been handled.
Caches also stores consecutive bytes, 32/64/128Byte, this fits most applications well, is a good fit to modern DRAM and reduces the overhead for cache bookeeping information: The cache is organized in cachelines and each cacheline has a tag that indicates which addresses are stored in the line.
Modern DRAM uses wide interfaces and also long bursts: The memory of a GPU is typically organized in 32-bit or 64-bit wide channels with GDDR5 memory that has a burst length of 8. This means that every transaction at the DRAM interface has to fetch at least 32-bit*8=32 byte or 64-bit*8=64 byte at a time, even if just a single byte is required from these bytes. Designing data layouts that lead to coalesced requests helps to use the DRAM interface efficiently.
GPUs also have a huge number of parallel threads active at the same time and rather small cache at the same time. CPUs are often able to use their caches to reorder their memory requests to DRAM friendly patterns. The larger number of threads and smaller caches on GPUs make this "cache based coalescing" less efficient on GPUs, as the data will often not stay long enough in the cache to get merged at the cache with other requests to the same cacheline.
Despite the "random access" name on "RAM" (Random-access Memory), Double-Data-Rate #3 Random-Access Memory (DDR3-RAM) is faster at accessing consecutive positions rather than randomly.
Case in point: "CAS Latency" is the amount of time that DDR3 RAM will stall when you're accessing a new "column", as your RAM chip is literally charging up to serve the new data from another location on the chip.
EDIT: Jan Lucas argues that RAS Latency is more important in practice. See his comment for details.
There's roughly a 10ns delay whenever you switch columns. So, if you have a bunch of memory accesses, if you keep access a bunch of data 'close' to each other, then you don't invoke a CAS delay.
So if you have 20-words to access at a particular location, its more efficient to access those 20-words before moving to a new memory location (invoking a CAS delay). Otherwise, you'll have to invoke ANOTHER CAS delay to "switch back" between memory locations.
Its just around 10 nanoseconds, but that amount of time adds up over time.
Can you please tell me some example code where we use ignorable amount of CPU and storage but heavy use of RAM? Like, if I run a loop and create objects, this will consume RAM but not CPU or storage. I mean tell me some memory expensive operations.
appzYourLife gave a good example, but I'd like to give a more conceptual answer.
Memory is slow. Like it's really slow, at least on the time scale that CPUs operate on. There is a concept called the memory hierarchy, which illustrates the trade off between cost/capacity and speed.
To prevent a fast CPU from wasting its time waiting on slow memory, we came up with CPU cache, which is a very small amount (it's expensive!) of very fast memory. The CPU never directly interacts with RAM, only the lowest level of CPU cache. Any time the CPU needs data that doesn't fall in the cache, it dispatches the memory controller to go fetch the desired data from RAM and put it in cache. The memory controller does this directly, without CPU involvement (so that the CPU can handle another process while wasting on this slow memory I/O).
The memory controller can be smart about how it does its memory fetching however. The principle of locality comes into play, which is the trend that CPUs tend to deal mostly with closely related (close in memory) data, such as arrays of data or long series of consecutive instructions. Knowing this, the memory controller can prefetch data from RAM that it predicts (according to various prediction algorithms, a key topic in CPU design) might be needed soon, and makes it available to the CPU before the CPU even knows it will need it. Think of it like a surgeon's assistant, who preempts what tools will be needed, and offers to hand them to the surgeon the moment they're needed, without the surgeon needing to request them, and without making the surgeon wait for the assistant to go get them and come back.
To maximize RAM usage, you'd need to minimize cache usage. This can be done by doing a lot of unexpected jumps between distant locations in memory. Typically, linked structures (such as linked lists) can cause this to happen. If a linked structure is composed of nodes that are scattered all throughout RAM, then there is no way for the memory controller to be able to predict all their locations and prefetch them. Traversing such a structure will cause many "cache misses" (a memory request for which the data isn't cached, and must be fetched from RAM), which are RAM intensive.
Ultimately, the CPU would usually be used heavily too, because it won't sit around waiting for the memory access, but will instead execute the instructions of the other processes running on the system, if there are any.
In Swift the Int64 type requires 64 bit of memory. So if you allocate space for 1000000 Int64 you will reserve memory for 8 MB.
UnsafeMutablePointer<Int64>.alloc(1000000)
The process should not consume much CPU since you are not initializing that memory, you are just allocating it.
I'm talking about LRU memory page replacement algorithm implement in C, NOT in Java or C++.
According to the OS course notes:
OK, so how do we actually implement a LRU? Idea 1): mark everything we touch with a timestamp.
Whenever we need to evict a page, we select the oldest page (=least-recently used). It turns out that this
simple idea is not so good. Why? Because for every memory load, we would have to read contents of the
clock and perform a memory store! So it is clear that keeping timestamps would make the computer at
least twice as slow. I
Memory load and store operation should be very fast. Is it really necessary to get rid of these little tiny operations?
In the case of memory replacement, the overhead of loading page from disk should be a lot more significant than memory operations. Why would actually care about memory store and load?
If what the notes said isn't correct, then what is the real problem with implementing LRU with timestamp?
EDIT:
As I dig deeper, the reason I can think of is like the following. These memory store and load operations happen when there is a page hit. In this case, we are not loading page from disks, so the comparison is not valid.
Since the hit rate is expected to be very high, so updating the data structure associated with LRU should be very frequent. That's why we care about the operations repeated in the udpate process, e.g., memory load and store.
But still, I'm not convincing how significant the overhead is to do memory load and store. There should be some measurements around. Can someone point me to them? Thanks!
Memory load and store operations can be quite fast, but in most real life cases the memory subsystem is slower - sometimes much slower - than the CPU's execution engine.
Rough numbers for memory access times:
L1 cache hit: 2-4 CPU cycles
L2 cache hit: 10-20 CPU cycles
L3 cache hit: 50 CPU cycles
Main memory access: 100-200 CPU cycles
So it costs real time to do loads and stores. With LRU, every regular memory access will also incur the cost of a memory store operation. This alone doubles the number of memory accesses the CPU does. In most situations this will slow the program execution. In addition, on a page eviction all the timestamps will need to be read. This will be quite slow.
In addition, reading and storing the timestamps constantly means they will be taking up space in the L1 or L2 caches. Space in these caches is limited, so your cache miss rate for other accesses will probably be higher, which will cost more time.
In short - LRU is quite expensive.
Since cache inside the processor increases the instruction execution speed. I'm wondering what if we increase the size of cache to many MBs like 1 GB. Is it possible? If it is will increasing the cache size always result in increased performance?
There is a tradeoff between cache size and hit rate on one side and read latency with power consumption on another. So the answer to your first question is: technically (probably) possible, but unlikely to make sense, since L3 cache in modern CPUs with size of just a few MBs has read latency of about dozens of cycles.
Performance depends more on memory access pattern than on cache size. More precisely, if the program is mainly sequential, cache size is not a big deal. If there are quite a lot of random access (ex. when associative containers are actively used), cache size really matters.
The above is true for single computational tasks. In multiprocess environment with several active processes bigger cache size is always better, because of decrease of interprocess contention.
This is a simplification, but, one of the primary reasons the cache increases 'speed' is that it provides a fast memory very close to the processor - this is much faster to access than main memory. So, in theory, increasing the size of the cache should allow more information to be stored in this 'fast' memory, and thereby improve performance.. In the real world things are obviously much more complex than this, and there will of course be added complexity, and cost, associated with such a large cache, and with dealing with issues like cache coherency, caching algorithms etc.
As cache stores data temporary. Cache is used to locate the file easily that has been frequently using. So if the size of cache increased upto 1gb or more it will not stay as cache, it becomes RAM. Data is stored in ram temporary. So if cache isn't used, when data is called by processor, ram will take time to fetch data to provide to the processor because of its wide size of 4gb or more. So we use cache as our temporary memory for the things we recently or frequently used. In this way, ram ram doesnt required to find and fetch data to give it to processor, because processor direct access data from cache, because of small size of cache, it doesnt take time to find data, and processor doesn't require to call ram to fetch data, all of this done fastly without ram. Lets take an example, we have a wide classroom (RAM) , our principal (processor) call class CR (Data) for some purposes, then ones will go to the class room and will find the CR in the class of 1000 students and take him to the principal. It takes time. When we specify a space(cache) for CR in the class, because principal mostly call CR of the class, so it will become easy to find CR becuase most of the time CR is called by Principal.
It's a frequently asked question, I know. But I tried every solution suggested (apc.stat=0, increasing shared memory, etc) without benefits.
Here's the screen with stats (you can see nginx and php5-fpm) and the parameters set in apc.ini:
APC is used for system and user-cache entries on multiple sites (8-9 WordPress sites and one with MediaWiki and SMF).
What would you suggest?
Each wordpress site is is going to cache a healthy amount in the user cache. I've looked at this myself in depth and have found the best 'guesstimate' is that if you use user cache in APC keep the fragmentation under 10%. This can sometimes mean you need to try and reserve upwards of 10x the amount of memory you intend to actually use for caching to avoid fragmentation. Start where you are and keep increasing memory allocated until fragmentation stays below 10% after running a while.
BTW: wordpress pages being cached are huge, so you'll probably need a lot of memory to avoid fragmentation.
Why 10% fragmentation? It's a bit of a black art, but I observed this is where performance start to noticeabily decline. I haven't found any good benchmarks (or run my own controlled tests) however.
This 10x amount to get this my seem insane, but root cause is APC has no way to defragment other than a restart (which completely dumps the cache). Having a slab of 1G of memory when you only plan to use 100-200m gives a lot of space to fill without having to look for 'holes' to put stuff. Think about bad old FAT16 or FAT32 disk performance with Windows 98--constant need to manually defrag as the disk gets over 50% full.
If you can't afford the extra memory to spare, you might want to look at either memcached or plain old file caching for your user cache.