Avoiding Translation Look aside buffer ( TLB ) pollution when mmap() - linux-kernel

When we want to write a data item, the block containing the data is brought into the cache first and data item is written into the cache. This can cause cache pollution. To avoid this, Intel has introduced no temporal instructions.
If I'm going to be using mmap() to write data to the file and never going to read again, is it possible to avoid TLB entry creation for this ? Is there anything instruction similar to non temporal instructions available ?

TLB entries are needed by the CPU to map from the virtual address to the physical address, so it is not possible to avoid them with mmap() or any similar API.
Even if it were possible to avoid storing the mapping in the TLB, every access to the mapped memory would need to reload the corresponding entries from the page tables, so the performance would be much worse.
Non-temporal accesses make sense only for stores, but the page table entries are read.

Related

Does mmap directly access the page cache, or a copy of the page cache?

To ask the question another way, can you confirm that when you mmap() a file that you do in fact access the exact physical pages that are already in the page cache?
I ask because I’m doing testing on a 192 core machine with 1TB of RAM, on a 400GB data file that is pre-cached into the page cache prior to the test (by just dropping the cache, then doing md5sum on the file).
Initially, I had all 192 threads each mmap the file separately, on the assumption that they would all get (basically) the same memory region back (or perhaps the same memory region but somehow mapped multiple times). Accordingly, I assumed two threads using two different mappings to the same file would both have direct access to the same pages. (Let’s ignore NUMA for this example, though obviously it’s significant at higher thread counts.)
However, in practice I found performance would get terrible at higher thread counts when each thread separately mmapped the file. When we removed that and instead just did a single mmap that was passed into the thread (such that all threads just directly access the same memory region), then performance improved dramatically.
That’s all great, but I’m trying to figure out why. If in fact mmapping a file just grants direct access to the existing page cache, then I would think that it shouldn’t matter how many times you map it — it should all go to the exact same place.
But given that there was such a performance cost, it seemed to me that in fact each mmap was being independently and redundantly populated (perhaps by copying from the page cache, or perhaps by reading again from disk).
Can you comment on why I was seeing such different performance between shared access to the same memory, versus mmapping the same file?
Thanks, I appreciate your help!
I think I found my answer, and it deals with the page directory. The answer is yes, two mmapped regions of the same file will access the same underlying page cache data. However, each mapping needs to independently map each of the virtual pages to the physical pages -- meaning 2x as many entries in the page directory to access the same RAM.
Basically, each mmap() creates a new range in virtual memory. Every page of that range corresponds to a page of physical memory, and that mapping is stored in a hierarchical page directory -- with one entry per 4KB page. So every mmap() of a large region generates a huge number of entries in the page directory.
My guess is it doesn't actually define them all up front, which is why mmap() is instant to call even for a giant file. But over time it probably has to establish those entries as there are faults on the mmapped range, meaning over the course of time it gets filled out. This extra work to populate the page directory is probably why threads using different mmaps are slower than threads sharing the same mmap. And I bet the kernel needs to erase all those entries when unmapping the range -- which is why unmmap() is so slow.
(There's also the translation lookaside buffer, but that's per-CPU, and so small I don't think that matters much here.)
Anyway, it sounds like re-mapping the same region just adds extra overhead, for what seems to me like no gain.

Does the address translation of paging decrease memory access performance?

When paging is enabled, some hardware is responsible for translating virtual memory addresses into physical addresses. Known translations are usually kept in some sort of cache, the translation look aside buffer (TLB).
Assuming a memory access where the address translation is cached, is it any slower than directly accessing memory without paging enabled?
I'm wondering about the overhead of that translation, even when it's cached, since the access to that cache will probably also take some (although very short) time. Or is that time planned as part of the clock cycle?
(To make it clear, my question is not about page faults or cache misses of the TLB)
Like everything in life, it depends! :-)
Let's assume, for the sake of simplicity, that (a) we're talking about data rather than instructions (b) all data memory accesses hit the level 1 cache (c) the level 1 data cache is a typical set associative cache.
Each block of the data cache must be identified with an address (less the offset). If the cache uses virtual addresses then no translation need take place and there is no overhead. If the cache uses physical addresses then the address must be translated prior to the data access adding additional latency to the request. Even for a small TLB, I don't think a high performance processor could both translate the address and then complete the cache request within the same cycle. So it's fair to assume that a physically addressed cache does indeed have overhead of address translation.
So virtually addressed caches sound like the better deal, right? Unfortunately it's a double-edged sword. The problem is that virtual memory often allows multiple virtual addresses to map to the same physical address. If in our cache there are two virtual addresses that map to a single physical address, modifying one will not be reflected in the other.
So, there is an option between these two extremes. Still assuming a set associative cache, we can use the virtual address as the index while simultaneously translating the address to a physical one. After, we use the physical address as the tag to access the data. This way we take the TLB translation off the critical path thus achieving similar performance to the virtually addressed cache. It also allows us to avoid this virtual/physical aliasing problem, although it often needs a little extra help from the operating system.
So, you can see that it can be the same or slower, depending on how the cache is configured.

Paging and TLB operating systems

I'm really stuck on this question for my OS class, I don't want someone to just give me the answer though, instead if someone could tell me how to work it out.
Example Question:
This system uses simple paging and TLB
Each memory access requires 80ns
TLB access requires 10ns
TLB hit rate is 80%.
Work out the actual speedup because of the TLB?
NOTE: I changed the memory accessed required and the TLB access requires part of the question because as I said I don't want the answer, just a way to work it out.
In case the virtual address translation is cached in the TLB, all we need is one lookup in the TLB that will give us a physical address, and we are done. The interesting part is if we need to do the page table walk. Think carefully about what the system has to do in case it did not find an address in the TLB (well it already had to do a TLB look-up). Memory access takes 80ns, but how many of them do you need to actually get the physical address? Pretty much every paging architecture follows the schema that page-tables are stored in memory and only the entry point, the address that points to the base of the first page table (the root) is in a register.
If you have the amount of time you can calculate the speed-up by comparing it to the TLB access time.
On TLB Hit 80% your required to access time 2ns and to access that page in main memory required 20ns therefore one part is
0.8×(2+20)
On TLB miss i.e. (1-0.8) 20% for that you are checking TLB again so required 2ns when it is TLB miss it will check into Page Table but base Address of Page Table is into Main Memory so it requires 20ns and when it searches into PT it will getting desired Frame and again required memory access time to access data from main memory so miss calculation is
0.2×(2+20+20)
From above 2 :
Effective access time=0.8×(2+20)+0.2×(2+20+20)
= 26ns

Does memory address translation need extra access to memory?

I've got a question about virtual memory management, more specifically, the address translation.
When an application runs, the CPU receives instructions containing virtual memory addresses, and translates them into physical addresses via the page table.
My question is, since the page table also aside at a memory block, does that means the CPU has to access the memory twice in a single memory-access instruction? If the answer is no, then how does this actually work? Which part did I miss?
Could anyone give me some details about this?
As usual the answer is neither yes or no.
Worst case you have to do a walk of the page table, which is indeed stored in (some kind of) memory, this is not necessarily only one lookup, it can be multiple lookups, see for example a two-level table (example from wikipedia).
However, typically this page table is accompanied by a hardware assist called the translation lookaside buffer, this is essentially a cache for the page table, the lookup process can be seen in this image. It works just as you would expect a cache too work, if a lookup succeeds you happily continue with the physical fetch, if it fails you proceed to the aforementioned page walk and you update the cache afterwards.
This hardware assist is usually implemented as a CAM (Content Addressable Memory), something that's most used in network processing but is also very useful here. It is a memory-component that does not do the lookup based upon an address but based upon 'content', or any generic key (the keys dont' have to be contiguous, incrementing numbers). In this case the key would be your virtual address, and the resulting memory lookup would be your physical address. As this CAM is a separate component and as it is very fast you could state that as long as you hit it you don't incur any extra memory overhead for virtual -> physical address translation.
You could ask why they don't put the whole page table in a CAM? Quite simply, CAM's are both quite expensive and more importantly quite energy-hungry, so you don't want to make them too big (we wouldn't want a laptop that requires 1KW to run do we?).
Sometimes.
The MMU contains a cache of virtual to physical address mapping, called a TLB (Translation Lookaside Buffer).
If the page in question is not in the TLB (a TLB miss), then it needs to load the relevant piece of page table from main memory into that cache first, which will need additional memory access.
Finally, if the page cannot be found at all, a trap is issued to the CPU (a page fault), and the CPU have an opportunity to fix this - e.g. allocate memory, load the piece from a file, swap space and similar.
The details on how this is done varies between architectures, on some, the TLB miss also involves the CPU to configure the TLB, though on most this is automatic. (but the CPU would have to flush the TLB when doing a context switch, and load a new pagetable for e.g. a new process)
More info e.g. here https://www.kernel.org/doc/gorman/html/understand/understand006.html

What is TLB shootdown?

What is a TLB shootdown in SMPs?
I am unable to find much information regarding this concept. Any good example would be very much appreciated.
A TLB (Translation Lookaside Buffer) is a cache of the translations from virtual memory addresses to physical memory addresses. When a processor changes the virtual-to-physical mapping of an address, it needs to tell the other processors to invalidate that mapping in their caches.
That process is called a "TLB shootdown".
A quick example:
You have some memory shared by all of the processors in your system.
One of your processors restricts access to a page of that shared memory.
Now, all of the processors have to flush their TLBs, so that the ones that were allowed to access that page can't do so any more.
The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown.
I think the question demands a more detailed answer.
page table: a data structure that stores the mapping between virtual memory (software) and physical memory (hardware)
however, the page table can be quite large and traversing the page table (to find the virtual address's corresponding physical address) can be a time consuming process. To make this process faster, a cache called the TLB (Translation Lookaside Buffer) is used, which stores the recently accessed virtual memory addresses.
As can be clearly seen the TLB entries need to be in sync with their respective page table entries at all times. Now the TLBs are a per-core cache ie. every core has its own TLB.
Whenever a page table entry is modified by any of the cores, that particular TLB entry is invalidated in all of the cores. This process is called TLB shootdown.
TLB flushing can be triggered by various virtual memory operations that change the page table entries like page migration, freeing pages etc.

Resources