How exactly do minor page faults get identified/resolved? - linux-kernel

I feel very clear on what happens with a segmentation fault and a major page fault, but I'm a bit more curious on the subtleties of minor page faults, and maybe an example would be with dynamically linked libraries. Wikipedia says, for instance:
If the page is loaded in memory at the time the fault is generated, but is not marked in the memory management unit as being loaded in memory, then it is called a minor or soft page fault. The page fault handler in the operating system merely needs to make the entry for that page in the memory management unit point to the page in memory and indicate that the page is loaded in memory; it does not need to read the page into memory. This could happen if the memory is shared by different programs and the page is already brought into memory for other programs.
The line, "The page fault handler in the operating system merely needs to make the entry for that page in the memory management unit point to the page in memory" confuses me. Each process has its own page table. So if I try to map in, say, libc, what's the process that the kernel goes through to figure out that it's already been mapped? How does it know that another process is using it or that there's already a frame associated with it? Does this happen with the page cache? I was reading a bit about it here, but I think some clarification would be nice on the steps that occur in the kernel to identify and resolve a minor page fault would be helpful.
Edit: It looks like a radix tree is used to keep track? Although I'm not quite sure I'm understanding this correctly.

At first, the kernel has no idea if the page is in memory or not. Presumably, the process does have a handle open to the file however, so the kernel performs an operation through the kernel-side file descriptor entry. This involves calling into the filesystem which, of course, knows what pages of the file are resident in memory since it's the code that would load the page were one needed.

Related

About TLB entries and Page table entries

From a website about TLB: (https://www.bottomupcs.com/virtual_memory_hardware.xhtml#other-page-related-faults):
For the following parts I highlighted:
1: Is the format of TLB entries the same as PTE(page table entries)? And it's not clear
the "the page" in the page can be marked as mean TLB entry or PTE?
2: For "the pages" in go through all the pages, are they TLB entries or PTEs?
3: Why it's moved out of not "moved into"?
4: Is that the order is (1)set the two bits (2)put into TLB, or the converse?
There are two other important faults that the TLB can generally generate which help to mange accessed and dirty pages. Each page generally contains an attribute in the form of a single bit which flags if the page has been accessed or is dirty.
An accessed page is simply any page that has been accessed. 1When a page translation is initially loaded into the TLB the page can be marked as having been accessed (else why were you loading it in?[19])
2The operating system can periodically go through all the pages and clear the accessed bit to get an idea of what pages are currently in use. When system memory becomes full and it comes time for the operating system to choose pages to be swapped out to disk, obviously those pages whose accessed bit has not been reset are the best candidates for removal, because they have not been used the longest.
A dirty page is one that has data written to it, and so does not match any data already on disk. For example, if a page is loaded in from swap and then written to by a process, 3before it can be moved out of swap it needs to have its on disk copy updated. A page that is clean has had no changes, so we do not need the overhead of copying the page back to disk.
Both are similar in that they help the operating system to manage pages. The general concept is that a page has two extra bits; the dirty bit and the accessed bit. 4When the page is put into the TLB, these bits are set to indicate that the CPU should raise a fault.
When a process tries to reference memory, the hardware does the usual translation process. However, it also does an extra check to see if the accessed flag is not set. If so, it raises a fault to the operating system, which should set the bit and allow the process to continue. Similarly if the hardware detects that it is writing to a page that does not have the dirty bit set, it will raise a fault for the operating system to mark the page as dirty.
I suggest ignoring that link as a source. It is highly confusing. I do not see any mention of a specific implementation but it is clearly describing one.
In any rationally designed processor, the TLB is completely transparent to programmers (even systems programmers). It is entirely a piece of hardware.
1: Is the format of TLB entries the same as PTE(page table entries)?
A programmer never sees TLB entries. They could have the same format as PTEs. They might not.
2: For "the pages" in go through all the pages, are they TLB entries or PTEs?
Programmers have no access to TLB entries. They have to refer to PTEs.
3: Why it's moved out of not "moved into"?
Probably but there seems to be a lot of confusion in your link.
4: Is that the order is (1)set the two bits (2)put into TLB, or the converse?
This is describing some specific, unnamed implementation. Most processors have a dirty bit but not all have an accessed bit.

UNIX system call to unset the reference bit of a specific page in page table?

I'm trying to count hits of a specific set of pages, by hacking the reference bits in the page table. Is there any system call or any other way to unset reference bits (in UNIX-like systems)?
A page table is the data structure used by a virtual memory system in a computer operating system to store the mapping between virtual addresses and physical addresses. (https://en.wikipedia.org/wiki/Page_table)
In unix-like systems there is a bit associated with each page table entry, called "reference" bit, which indicates if a page was accessed since the bit was unset.
The linux kernel unsets these reference bits periodically and checks a while after that to know what pages have been accessed, in order to detect "hot" pages. But this information is very coarse grain and low-precision since it doesn't say anything about the number of accesses and their time.
I want to count accesses to specific pages during shorter epochs by unsetting reference bits then check if the pages have been accessed after a short time.
Therefore, I was wondering if there is any system call or CPU interrupt which provides means to unset "reference bits". Otherwise, I need to dive deep into kernel to see what goes on down there.
There is no API for resetting the page reference bits. Page management is a very twitchy aspect of kernel tuning and no one wants to upset it. Of course you could modify the kernel to your needs.
Instead, you might look into Valgrind which is a debugging and profiling tool for running a single program. Ordinarily it detects subtle memory errors such as detecting use of a dynamic memory block after it has been freed.
If you need page management information for the system as a whole, I think the most expedient solution is hacking the kernel.

How can I force a page to fault, even if it is already in the tlb?

I'm trying to write a toy working set estimator, by keeping track of page faults over a period of time. Whenever a page is faulted in, I want to record that it was touched. The scheme breaks down when I try to keep track of accesses to already-present pages. If a page is read from or written to without triggering a fault, I have no way of tracking the access.
So then, I want to be able to cause a "lightweight" fault to occur on a page access. I've heard of some method at some point, but I didn't understand why it worked so it didn't stick in my mind. Dirty bit maybe?
You can use mprotect with PROT_NONE ("Page cannot be accessed"). Then any access to the given page will cause a fault.
The usual way to do this is to simply clear the "present" bit for the page, while leaving the page in memory and the necessary kernel data structures in place so that the kernel knows this.
However, depending on the architecture in question you may have better options - for example, on x86 there is an "Accessed" flag (bit 5 in the PTE) that is set whenever the PTE is used in a linear address translation. You can simply clear this bit whenever you like, and the hardware will set it to record that the page was touched.
Using either of these methods you will need to clear the cached translation for that page out of the TLB - on x86 you can use the INVLPG instruction.

Programatically read program's page fault count on Windows

I'd like to my Windows C++ program to be able to read the number of hard page faults it has caused. The program isn't running as administrator. Edited to add: To be clear, I'm not as interested in the aggregate page fault count of the whole system.
It looks like ETW might export counters for this, but I'm having a lot of difficulty figuring out the API, and it's not clear what's accessible by regular users as compared to administrators.
Does anyone have an example of this functionality lying around? Or is it simply not possible on Windows?
(OT, but isn't it sad how much easier this is on *nix? gerusage() and you're done.)
afai can tell the only way to do this would be to use ETW (Event Tracing for Windows) to monitor kernel Hard Page Faults. The event payload has a thread ID that you might be able to correlate with an existing process (this is going to be non-trivial btw) to produce a running per-process count. I don't see any way to get historical information per process.
I can guarantee you that this is A Hard Problem because Process Explorer supports only Page Faults (soft or hard) in its per-process display.
http://msdn.microsoft.com/en-us/magazine/ee412263.aspx
A page fault occurs when a sought-out
page table entry is invalid. If the
requested page needs to be brought in
from disk, it is called a hard page
fault (a very expensive operation),
and all other types are considered
soft page faults (a less expensive
operation). A Page Fault event payload
contains the virtual memory address
for which a page fault happened and
the instruction pointer that caused
it. A hard page fault requires disk
access to occur, which could be the
first access to contents in a file or
accesses to memory blocks that were
paged out. Enabling Page Fault events
causes a hard page fault to be logged
as a page fault with a type Hard Page
Fault. However, a hard fault typically
has a considerably larger impact on
performance, so a separate event is
available just for a hard fault that
can be enabled independently. A Hard
Fault event payload has more data,
such as file key, offset and thread
ID, compared with a Page Fault event.
I think you can use GetProcessMemoryInfo() - Please refer to http://msdn.microsoft.com/en-us/library/ms683219(v=vs.85).aspx for more information.
Yes, quite sad. Or you could just not assume Windows is so gimp that it doesn't even provide a page fault counter and look it up: Win32_PerfFormattedData_PerfOS_Memory.
There is a C/C++ sample on Microsoft's site that explain how to read performance counters: INFO: PDH Sample Code to Enumerate Performance Counters and Instances
You can copy/paste it and I think you're interested by the "Memory" / "Page Reads/sec" counters, as stated in this interesting article: The Basics of Page Faults
This is done with performance counters in windows. It's been a while since I've done anything with them. I don't recall whether or not you need to run as administrator to query them.
[Edit]
I don't have example code to provide but according to this page, you can get this information for a particular process:
Process : Page Faults/sec. This is an
indication of the number of page
faults that occurred due to requests
from this particular process.
Excessive page faults from a
particular process are an indication
usually of bad coding practices.
Either the functions and DLLs are not
organized correctly, or the data set
that the application is using is being
called in a less than efficient
manner.
I don't think you need administrative credential to enumerate the performance counters. A sample at codeproject Performance Counters Enumerator

What is Hard Faults in XPerf

I'm trying to profile a system with XPerf.
And see that performance problems occurs when there is activity in HardFaults !
But what I cant figure out and find in google what are these Hard Faults that xperf shows.
What are they related to?
What do they indicate?
Is there any universal remedy for such situations?
Hard faults table
Indeed.
"First of all, a "hard fault" was previously called a "page fault" in earlier versions of Windows. Perhaps page faults were more easily understood from the name, too. A hard fault happens when the address in memory of part of a program is no longer in main memory, but has been instead swapped out to the paging file, making the system go looking for it on the hard disk. When this happens a lot, it causes slowdowns and increased hard disk activity. When it happens an awful lot, the possibility of hard disk thrashing arises. That's when a program stops responding, but the hard drive continues to run for an extended period. This has historically been referred to as "getting into the page file."
Here is the article.
http://www.brighthub.com/computing/windows-platform/articles/52249.aspx
But be carefull with following suggestions of this article, because it is not quite correct to do so:
http://player.microsoftpdc.com/Session/1689962d-dea2-48bd-80d8-96e954fa5329
http://player.microsoftpdc.com/Session/1c97b279-d7e3-4a3e-9a76-0dac23dfddb5
A hard fault is when a the request process private page or file backed page is not in RAM. Hard faults occur for allocations that have been paged out, but also accesses to data file and executable images.
The type of page will determine where the data data will be read from. Most hard faults are not for data from teh page file, but for data files (your word doc, for example).
Vaguely I remember a hard fault is when the requested virtual memory block is not in memory anymore and needs to be paged-in from the swapfile.

Resources