How are page replacement policies like LRU/LFU implemented ? Does the hardware MMU track the reference count(in case of LFU)?
Is it possible for it to be implemented as part of the kernel?
Generally, the hardware provides minimal support for tracking which pages are accessed, and the OS kernel then uses that to implement some kind of pseudo-LRU paging policy.
Fox example, on x86, the MMU will set the 'A' bit in the PTE (page table entry) whenever a page is accessed. So the kernel continuously loops though all the memory in use, checking, and clearing this bit. Any page that has the bit set has been accessed since the last sweep, and any page where the bit is (still) clear since the last sweep has not. These pages are candidates for replacement. The details vary from OS to OS, but generally there's some sort of queue structure(s) where these pages are tracked, and the oldest ones replaced.
Related
I am thinking about 'Minimizing page faults (and TLB faults) while “walking” a large graph'
'How to know whether a pointer is in physical memory or it will trigger a Page Fault?' is a related question looking at the problem from the other side, but does not have a solution.
I wish to be able to load some data from memory into a register, but have the load abort rather than getting a page fault, if the memory is currently paged out. I need the code to work in user space on both Windows and Linux without needing any none standard permission.
(Ideally, I would also like to abort on a TLB fault.)
The RTM (Restricted Transactional Memory) part of the TXT-NI feature allows to suppress exceptions:
Any fault or trap in a transactional region that must be exposed to software will be suppressed. Transactional
execution will abort and execution will transition to a non-transactional execution, as if the fault or trap had never
occurred.
[...]
Synchronous exception events (#DE, #OF, #NP, #SS, #GP, #BR, #UD, #AC, #XM, #PF, #NM, #TS, #MF, #DB, #BP/INT3) that occur during transactional execution may cause an execution not to commit transactionally, and
require a non-transactional execution. These events are suppressed as if they had never occurred.
I've never used RTM but it should work something like this:
xbegin fallback
; Don't fault here
xend
; Somewhere else
fallback:
; Retry non-transactionally
Note that a transaction can be aborted for many reasons, see chapter 16.8.3.2 of the Intel manual volume 1.
Also note that RTM is not ubiquitous.
Besides RTM I cannot think of another way to suppress a load since it must return a value or eventually signal an abort condition (which would be the same as a #PF).
There's unfortunately no instruction that just queries the TLB or the current page table with the result in a register, on x86 (or any other ISA I know of). Maybe there should be, because it could be implemented very cheaply.
(For querying virtual memory for pages being paged out or not, there is the Linux system call mincore(2) that produces a bitmap of present/absent for a range of pages starting (given as void* start / size_t length. That's maybe similar to the HW page tables so probably could let you avoid page faults until after you've touched memory, but unrelated to TLB or cache. And maybe doesn't rule out soft page faults, only hard. And of course that's only the current situation: pages could be evicted between query and access.)
Would a CPU feature like this be useful? probably yes for a few cases
Such a thing would be hard to use in a way that paid off, because every "false" attempt is CPU time / instructions that didn't accomplish any useful work. But a case like this could possibly be a win, when you don't care what order you traverse a tree / graph in, and some nodes might be hot in cache, TLB, or even just RAM while others are cold or even paged out to disk.
When memory is tight, touching a cold page could even evict a currently-hot page before you get to it.
Normal CPUs (like modern x86) can do speculative / out-of-order page walks (to fill TLB entries), and definitely speculative loads into cache, but not page faults. Page faults are handled in software by the kernel. Taking a page-fault can't happen speculatively, and is serializing. (CPUs don't rename the privilege level.)
So software prefetch can cheaply get the hardware to fill TLB and cache while you touch other memory, if you the one you're going to touch 2nd was cold. If it was hot and you touch the cold side first, that's unfortunate. If there was a cheap way to check hot/cold, it might be worth using it to always go the right way (at least on the first step) in traversal order when one pointer is hot and the other is cold. Unless a read-only transaction is quite cheap, it's probably not worth actually using Margaret's clever answer.
If you have 2 pointers you will eventually dereference, and one of them points to a page that's been paged out while the other is hot, the best case would be to somehow detect this and get the OS to start paging in one page from disk in the background while you traverse the side that's already in RAM. (e.g. with Windows
PrefetchVirtualMemory or Linux madvise(MADV_WILLNEED). See answers on the OP's other question: Minimizing page faults (and TLB faults) while "walking" a large graph)
This will require a system call, but system calls are expensive and pollute caches + TLBs, especially on current x86 where Spectre + Meltdown mitigation adds thousands of clock cycles. So it's not worth it to make a VM prefetch system call for one of every pair of pointers in a tree. You'd get a massive slowdown for cases when all the pointers were in RAM.
CPU design possibilities
Like I said, I don't think any current ISAs have this, but it would I think be easy to support in hardware with instructions that run kind of like load instructions, but produce a result based on the TLB lookup instead of fetching data from L1d cache.
There are a couple possibilities that come to mind:
a queryTLB m8 instruction that writes flags (e.g. CF=1 for present) according to whether the memory operand is currently hot in TLB (including 2nd-level TLB), never doing a page walk. And a querypage m8 that will do a page walk on a TLB miss, and sets flags according to whether there's a page table entry. Putting the result in a r32 integer reg you could test/jcc on would also be an option.
a try_load r32, r/m32 instruction that does a normal load if possible, but sets flags instead of taking a page fault if a page walk finds no valid entry for the virtual address. (e.g. CF=1 for valid, CF=0 for abort with integer result = 0, like rdrand. It could make itself useful and set other flags (SF/ZF/PF) according to the value, if there is one.)
The query idea would only be useful for performance, not correctness, because there'd always be a gap between querying and using during which the page could be unmapped. (Like the IsBadXxxPtr Windows system call, except that that probably checks the logical memory map, not the hardware page tables.)
A try_load insn that also sets/clear flags instead of raising #PF could avoid the race condition. You could have different versions of it, or it could take an immediate to choose the abort condition (e.g. TLB miss without attempt page-walk).
These instructions could easily decode to a load uop, probably just one. The load ports on modern x86 already support normal loads, software prefetch, broadcast loads, zero or sign-extending loads (movsx r32, m8 is a single uop for a load port on Intel), and even vmovddup ymm, m256 (two in-lane broadcasts) for some reason, so adding another kind of load uop doesn't seem like a problem.
Loads that hit a TLB entry they don't have permission for (kernel-only mapping) do currently behave specially on some x86 uarches (the ones that aren't vulnerable to Meltdown). See The Microarchitecture Behind Meltdown on Henry Wong's blod (stuffedcow.net). According to his testing, some CPUs produce a zero for speculative execution of later instructions after a TLB/page miss (entry not present). So we already know that doing something with a TLB hit/miss result should be able to affect the integer result of a load. (Of course, a TLB miss is different from a hit on a privileged entry.)
Setting flags from a load is not something that ever normally happens on x86 (only from micro-fused load+alu), so maybe it would be implemented with an ALU uop as well, if Intel ever did implement this idea.
Aborting on a condition other than TLB/page miss or L1d miss would require outer levels of cache to also support this special request, though. A try_load that runs if it hits L3 cache but aborts on L3 miss would need support from the L3 cache. I think we could do without that, though.
The low-hanging fruit for this CPU-architecture idea is reducing page faults and maybe page walks, which are significantly more expensive than L3 cache misses.
I suspect that trying to branch on L3 cache misses would cost you too much in branch misses for it to really be worth it vs. just letting out-of-order exec do its thing. Especially if you have hyperthreading so this latency-bound process can happen on one logical core of a CPU that's also doing something else.
When a page is swapped out to disk, some of its content might be present in a cache (I believe this would be a very rare scenario, because, if the page is not accessed for a long time, most likely the cache-lines containing its content would also have been evicted by then.) What happens to these cache-lines when the page is swapped out. Do they need to be invalidated immediately ? Does it make any difference if the line is dirty or clean ? Who controls this process, OS or hardware or both ?
To explain why this need be taken care of, lets assume there are processes A & B and A is accessing physical page P1 at starting physical address X. Some of the contents of P1 must have been cached in different levels of caches. Now the page P1 is swapped out and page P2 of process B is brought in at the same address X. So if the cache-lines belonging to page P1 are not invalidated, then process B might get caches hit on those lines which originally belonged to process A for page P1 (because the physical tag would match).
Is this scenario valid ? Does it make any difference if the cache is VIPT or PIPT ?
It would be great, if you can cite how it is handled in modern OS/Processor.
If DMA (the preferred method of copying data between I/O devices and memory for large transfers) is cache coherent (i.e., the DMA agent checks the cache), then hardware will ensure that any dirty cache lines will be read when paging out and any old cache lines will be invalidated (or overwritten — some systems support I/O devices storing to cache). For non-coherent DMA, the OS would have to flush the page being paged out to memory.
Whether DMA is coherent is ISA and implementation dependent. For example, this blog post mentions that x86 guarantees coherence but Itanium and ARM do not. (Note that an implementation of an ISA can provide stronger guarantees.)
Whether the cache is virtually indexed or not would not impact the operations required because the OS would flush based on the virtual address and aliasing issues would already be handled in hardware or by software (e.g., by page coloring).
I've got a question about virtual memory management, more specifically, the address translation.
When an application runs, the CPU receives instructions containing virtual memory addresses, and translates them into physical addresses via the page table.
My question is, since the page table also aside at a memory block, does that means the CPU has to access the memory twice in a single memory-access instruction? If the answer is no, then how does this actually work? Which part did I miss?
Could anyone give me some details about this?
As usual the answer is neither yes or no.
Worst case you have to do a walk of the page table, which is indeed stored in (some kind of) memory, this is not necessarily only one lookup, it can be multiple lookups, see for example a two-level table (example from wikipedia).
However, typically this page table is accompanied by a hardware assist called the translation lookaside buffer, this is essentially a cache for the page table, the lookup process can be seen in this image. It works just as you would expect a cache too work, if a lookup succeeds you happily continue with the physical fetch, if it fails you proceed to the aforementioned page walk and you update the cache afterwards.
This hardware assist is usually implemented as a CAM (Content Addressable Memory), something that's most used in network processing but is also very useful here. It is a memory-component that does not do the lookup based upon an address but based upon 'content', or any generic key (the keys dont' have to be contiguous, incrementing numbers). In this case the key would be your virtual address, and the resulting memory lookup would be your physical address. As this CAM is a separate component and as it is very fast you could state that as long as you hit it you don't incur any extra memory overhead for virtual -> physical address translation.
You could ask why they don't put the whole page table in a CAM? Quite simply, CAM's are both quite expensive and more importantly quite energy-hungry, so you don't want to make them too big (we wouldn't want a laptop that requires 1KW to run do we?).
Sometimes.
The MMU contains a cache of virtual to physical address mapping, called a TLB (Translation Lookaside Buffer).
If the page in question is not in the TLB (a TLB miss), then it needs to load the relevant piece of page table from main memory into that cache first, which will need additional memory access.
Finally, if the page cannot be found at all, a trap is issued to the CPU (a page fault), and the CPU have an opportunity to fix this - e.g. allocate memory, load the piece from a file, swap space and similar.
The details on how this is done varies between architectures, on some, the TLB miss also involves the CPU to configure the TLB, though on most this is automatic. (but the CPU would have to flush the TLB when doing a context switch, and load a new pagetable for e.g. a new process)
More info e.g. here https://www.kernel.org/doc/gorman/html/understand/understand006.html
I am trying to understand how the linux kernel handles TLB misses. Specifically, I know that the page table walk happens in follow_page in mm/memory.c but how is follow_page called when a TLB miss occurs. How is the return value (struct page) of follow_page passed back to the hardware? Can someone illustrate a call graph for the TLB miss handling from when a TLB Miss exception is raised by the hardware to when follow_page is called?
I searched for follow_page inside the kernel code http://lxr.linux.no/linux+v3.4.4/+search=follow_page but the results do not seem to help much.
To make things clear, lets say the hardware is x86_64.
I found out that for most x86 architectures, the hardware does the page walk when a TLB miss occurs. The software page walk code follow_page in mm/memory.c is not invoked during a TLB miss. So, as per my understanding, there is no call graph for handling TLB misses in linux kernel.
As you noticed yourself, the MMU of intel processors since 80386 has hardware resolution for filling the translation lookaside buffers. That explains why the page table for this architecture have such a rigid structure. There are various places where walking down the page tables (as in follow_page) is required on this architecture, however, esp. in handle_mm_fault, although it is generally possible to expect a more specific case and drop most of the tests.
Title says it pretty much all : is there a way to get the lowest free virtual memory address under windows ? I should add that I am interested by this information at the beginning of the program (before any dynamic memory allocation has been done).
Why I need it : trying to build a malloc implementation under Windows. If it is not possible I would have to really to whatever VirtualAlloc() returns when given NULL as first parameter. While you would expect it to do something sensible, like allocation memory at the bottom of what is available, there are no guarantees.
This can be implemented yourself by using VirtualQuery looking for pages that are marked as free. It would be relatively slow though. (You will also need to consider allocation granularity which is different from page size.)
I will say that unless you need contiguous blocks of memory, trying to keep everything close together is mostly meaningless since if two pages of virtual memory might be next to each other in the address space, there is no reason to assume they are close to each other in physical memory. In fact, even if they are close to each other at some point in time, if those pages get moved to backing store and then faulted back into memory, the page would not be faulted to the same physical address page.
The OS uses more complicated metrics than just what is the "lowest" memory address available. Specifically, VirtualAlloc allocates pages of memory, so depending on how much you're asking for, at least one page of unused address space has to be available at the starting address. So even if you think there's a "lower" address that it should have used, that address might not have been compatible with the operation that you asked for.