What happens to dirty cache lines which are not yet written back (assuming write-back cache), when the page it is part of is chosen for eviction by the Operating System. In other words what happens to the cache lines of the page, when a page is chosen for eviction from main memory.
The general assumption is that by the time the page is evicted from memory it is cold enough not to be cached. However would this pose a problem in larger caches? If say we have enough cache lines to fit 1 line from each page?
Nothing will directly happen to the cache lines. In an operating system, we must always assume the worse of the possibilities, and even small caches could face your proposed scenario. However, the operating system does not need to do anything directly to the cache lines, the normal process of paging will handle this scenario.
When the operating system decides to evict a page, it will first write its contents out to disk (as the processor would have marked the page table entry as dirty indicating that the page has been written to). In the process of doing so, the OS will read the values from the entire page including the dirty cache lines, so they will be written back, at least to disk.
Next as part of evicting the page, the operating system will invalidate the translations that map the virtual to physical addresses. Even if the cache lines are still dirty, they can no longer be accessed by the process (i.e., program). After this step, the operating system will write different data into the physical page, and this action will invalidate the cache lines. Finally, the newly repurposed physical page will be mapped to a virtual address.
Related
To ask the question another way, can you confirm that when you mmap() a file that you do in fact access the exact physical pages that are already in the page cache?
I ask because I’m doing testing on a 192 core machine with 1TB of RAM, on a 400GB data file that is pre-cached into the page cache prior to the test (by just dropping the cache, then doing md5sum on the file).
Initially, I had all 192 threads each mmap the file separately, on the assumption that they would all get (basically) the same memory region back (or perhaps the same memory region but somehow mapped multiple times). Accordingly, I assumed two threads using two different mappings to the same file would both have direct access to the same pages. (Let’s ignore NUMA for this example, though obviously it’s significant at higher thread counts.)
However, in practice I found performance would get terrible at higher thread counts when each thread separately mmapped the file. When we removed that and instead just did a single mmap that was passed into the thread (such that all threads just directly access the same memory region), then performance improved dramatically.
That’s all great, but I’m trying to figure out why. If in fact mmapping a file just grants direct access to the existing page cache, then I would think that it shouldn’t matter how many times you map it — it should all go to the exact same place.
But given that there was such a performance cost, it seemed to me that in fact each mmap was being independently and redundantly populated (perhaps by copying from the page cache, or perhaps by reading again from disk).
Can you comment on why I was seeing such different performance between shared access to the same memory, versus mmapping the same file?
Thanks, I appreciate your help!
I think I found my answer, and it deals with the page directory. The answer is yes, two mmapped regions of the same file will access the same underlying page cache data. However, each mapping needs to independently map each of the virtual pages to the physical pages -- meaning 2x as many entries in the page directory to access the same RAM.
Basically, each mmap() creates a new range in virtual memory. Every page of that range corresponds to a page of physical memory, and that mapping is stored in a hierarchical page directory -- with one entry per 4KB page. So every mmap() of a large region generates a huge number of entries in the page directory.
My guess is it doesn't actually define them all up front, which is why mmap() is instant to call even for a giant file. But over time it probably has to establish those entries as there are faults on the mmapped range, meaning over the course of time it gets filled out. This extra work to populate the page directory is probably why threads using different mmaps are slower than threads sharing the same mmap. And I bet the kernel needs to erase all those entries when unmapping the range -- which is why unmmap() is so slow.
(There's also the translation lookaside buffer, but that's per-CPU, and so small I don't think that matters much here.)
Anyway, it sounds like re-mapping the same region just adds extra overhead, for what seems to me like no gain.
From what I understand, demand paging is basically paging with swapping, so you can swap in a page when it is needed. But page replacement seems like more or less the same thing, where you bring in a page is needed and switching it with an existing page in physical memory.
So is there a distinct difference?
In a system that uses demand paging, the operating system copies a disk page into physical memory only if an attempt is made to access it and that page is not already in memory (i.e., if a page fault occurs). It follows that a process begins execution with none of its pages in physical memory, and many page faults will occur until most of a process's working set of pages is located in physical memory. This is an example of a lazy loading technique.
From Wikipedia's Demand paging:
Demand paging follows that pages should only be brought into memory if
the executing process demands them. This is often referred to as lazy
evaluation as only those pages demanded by the process are swapped
from secondary storage to main memory. Contrast this to pure swapping,
where all memory for a process is swapped from secondary storage to
main memory during the process startup.
Whereas, page replacement is simply the technique which is done when there occurs a page-fault. Page replacement is a technique which is utilised for both pure swapping and demand-paging.
Page Replacement simply means swapping two processes between memory and disk.
Demand Paging is a concept in which only required pages are brought into the memory. In case where a page required is not in the memory, the system looks for free frames in the memory. If there are no free frames, then a page replacement is done to bring the required page from the disk to the memory.
When a page is swapped out to disk, some of its content might be present in a cache (I believe this would be a very rare scenario, because, if the page is not accessed for a long time, most likely the cache-lines containing its content would also have been evicted by then.) What happens to these cache-lines when the page is swapped out. Do they need to be invalidated immediately ? Does it make any difference if the line is dirty or clean ? Who controls this process, OS or hardware or both ?
To explain why this need be taken care of, lets assume there are processes A & B and A is accessing physical page P1 at starting physical address X. Some of the contents of P1 must have been cached in different levels of caches. Now the page P1 is swapped out and page P2 of process B is brought in at the same address X. So if the cache-lines belonging to page P1 are not invalidated, then process B might get caches hit on those lines which originally belonged to process A for page P1 (because the physical tag would match).
Is this scenario valid ? Does it make any difference if the cache is VIPT or PIPT ?
It would be great, if you can cite how it is handled in modern OS/Processor.
If DMA (the preferred method of copying data between I/O devices and memory for large transfers) is cache coherent (i.e., the DMA agent checks the cache), then hardware will ensure that any dirty cache lines will be read when paging out and any old cache lines will be invalidated (or overwritten — some systems support I/O devices storing to cache). For non-coherent DMA, the OS would have to flush the page being paged out to memory.
Whether DMA is coherent is ISA and implementation dependent. For example, this blog post mentions that x86 guarantees coherence but Itanium and ARM do not. (Note that an implementation of an ISA can provide stronger guarantees.)
Whether the cache is virtually indexed or not would not impact the operations required because the OS would flush based on the virtual address and aliasing issues would already be handled in hardware or by software (e.g., by page coloring).
I'm using perf as basic event counter. I'm working on a program which suffers from data cache store misses. Which as as high as ratio of %80.
I know how caches in principle work. It loads from memory on various miss cases, removes data from cache when it pleases. What I don't understand is , what is difference between store - load misses. How does it differ loading and storing. How can you store-miss ?
A load-miss (as you know) is referring to when the processor needs to fetch data from main memory, but data does not exist in the cache. So whenever the processor wants some data from the main memory, it esquires the cache, and if the data is already loaded you get a load-hit and otherwise you get a load-miss.
A store-miss is related to when the processor wants to write back the newly calculated data to the main memory.When it wants to write-back the data to the main memory, it hasto make sure that the content of the cache and main memory are in sync with each other. It can happen with two different policies that you can find here: Writing Policies.
So no matter what policy you choose, you first need to check whether the data is already in the cache so you can store it to cache first (since it's faster), and if the data block you are looking for has been evicted from the cache, you get a store-miss related to that cache.
You can check the applet here, to get a better idea of what happens in different scenarios.
I'm not fully familiar with how perf define these events, but given the common definition I believe load/store miss is just a way to break down the overall miss rate counting, so that you may tell which accesses miss more often. Note that loads are usually performed speculatively (at least in modern x86 cpus), while stores are performed much later along the pipeline, after the commit point, so even a piece of code with both loads and stores to the same region can have different miss rates.
In MESI-based cache protocols a load would hit the cache, or miss and fetch the line from the memory or next cache levels, either exclusively if it's not owned by anyone else, or in a shared state if it is. It would write the data to the caches along the way in the process.
A store would fetch a line in the same manner, but use an RFO (read-for-ownership) request which grants it exclusive ownership and the right to modify the line. The line would still get cached, but once the new data is written to it locally (usually in your L1 cache), it would become modified. The hit/miss process would look the same though.
What Saman referred to in his answer is the breakdown between reads and writes. Loads and stores (and other forms of access like code-read) all form the "read" part, and writebacks (or intentional write-throughs using special command or mem types like uncacheable) form the "write part.
mmap can be used to share read-only memory between processes, reducing the memory foot print:
process P1 mmaps a file, uses the mapped memory -> data gets loaded into RAM
process P2 mmaps a file, uses the mapped memory -> OS re-uses the same memory
But how about this:
process P1 mmaps a file, loads it into memory, then exits.
another process P2 mmaps the same file, accesses the memory that is still hot from P1's access.
Is the data loaded again from disk? Is the OS smart enough to re-use the virtual memory even if "mmap count" dropped to zero temporarily?
Does the behaviour differ between different OS? (I'm mostly interested in Linux/OS X)
EDIT: In case the OS is not smart enough -- would it help if there is one "background process", keeping the file mmaped, so it never leaves the address space of at least one process?
I am of course interested in performance when I mmap and munmap the same file successively and rapidly, possibly (but not necessarily) within the same process.
EDIT2: I see answers describing completely irrelevant points at great length. To reiterate the point -- can I rely on Linux/OS X to not re-load data that already resides in memory, from previous page hits within mmaped memory segments, even though the particular region is no longer mmaped by any process?
The presence or absence of the contents of a file in memory is much less coupled to mmap system calls than you think. When you mmap a file, it doesn't necessarily load it into memory. When you munmap it (or if the process exits), it doesn't necessarily discard the pages.
There are many different things that could trigger the contents of a file to be loaded into memory: mapping it, reading it normally, executing it, attempting to access memory that is mapped to the file. Similarily, there are different things that could cause the file's contents to be removed from memory, mostly related to the OS deciding it wants the memory for something more important.
In the two scenarios from your question, consider inserting a step between steps 1 and 2:
1.5. another process allocates and uses a large amount of memory -> the mmaped file is evicted from memory to make room.
In this case the file's contents will probably have to get reloaded into memory if they are mapped again and used again in step 2.
versus:
1.5. nothing happens -> the contents of the mmaped file hang around in memory.
In this case the file's contents don't need to be reloaded in step 2.
In terms of what happens to the contents of your file, your two scenarios aren't much different. It's something like this step 1.5 that would make a much more important difference.
As for a background process that is constantly accessing the file in order to ensure it's kept in memory (for example, by scanning the file and then sleeping for a short amount of time in a loop), this would of course force the file to remain in memory. but you're probably better off just letting the OS make its own decision about when to evict the file and when not to evict it.
The second process likely finds the data from the first process in the buffer cache. So in most cases the data will not be loaded again from disk. But since the buffer cache is a cache, there are no guarantees that the pages don't get evicted inbetween.
You could start a third process and use mmap(2) and mlock(2) to fix the pages in ram. But this will probably cause more trouble than it is worth.
Linux substituted the UNIX buffer cache for a page cache. But the principle is still the same. The Mac OS X equivalent is called Unified Buffer Cache (UBC).