freebsd shpgperproc what is responsible for? - performance

i've googled a lot about what is "page share factor per proc" responsible for and found nothing. It's just interesting for me, i have no current problem with it for now, just curious (wnat to know more). In sysctl it is:
vm.pmap.shpgperproc
Thanks in advance

The first thing to note is that shpgperproc is a loader tunable, so it can only be set at boot time with an appropriate directive in loader.conf, and it's read-only after that.
The second thing to note is that it's defined in <arch>/<arch>/pmap.c, which handles the architecture-dependent portions of the vm subsystem. In particular, it's actually not present in the amd64 pmap.c - it was removed fairly recently, and I'll discuss this a bit below. However, it's present for the other architectures (i386, arm, ...), and it's used identically on each architecture; namely, it appears as follows:
void
pmap_init(void)
{
...
TUNABLE_INT_FETCH("vm.pmap.shpgperproc", &shpgperproc);
pv_entry_max = shpgperproc * maxproc + cnt.v_page_count;
and it's not used anywhere else. pmap_init() is called only once: at boot time as part of the vm subsystem initialization. maxproc, is just the maximum number of processes that can exist (i.e. kern.maxproc), and cnt.v_page_count is just the number of physical pages of memory available (i.e. vm.stats.v_page_count).
A pv_entry is basically just a virtual mapping of a physical page (or more precisely a struct vm_page, so if two processes share a page and both have them mapped, there will be a separate pv_entry structure for each mapping. Thus given a page (struct vm_page) that needs to be dirtied or paged out or something requiring a hw page table update, the list of corresponding mapped virtual pages can be easily found by looking at the corresponding list of pv_entrys (as an example, take a look at i386/i386/pmap.c:pmap_remove_all()).
The use of pv_entrys makes certain VM operations more efficient, but the current implementation (for i386 at least) seems to allocate a static amount of space (see pv_maxchunks, which is set based on pv_entry_max) for pv_chunks, which are used to manage pv_entrys. If the kernel can't allocate a pv_entry after deallocating inactive ones, it panics.
Thus we want to set pv_entry_max based on how many pv_entrys we want space for; clearly we'll want at least as many as there are pages of RAM (which is where cnt.v_page_count comes from). Then we'll want to allow for the fact that many pages will be multiply-virtually-mapped by different processes, since a pv_entry will need to be allocated for each such mapping. Thus shpgperproc - which has a default value of 200 on all arches - is just a way to scale this. On a system where many pages will be shared among processes (say on a heavily-loaded web server running apache), it's apparently possible to run out of pv_entrys, so one will want to bump it up.

I don't have a FreeBSD machine close to me right now but it seems this parameter is defined and used in pmap.c, http://fxr.watson.org/fxr/ident?v=FREEBSD7&im=bigexcerpts&i=shpgperproc

Related

UNIX system call to unset the reference bit of a specific page in page table?

I'm trying to count hits of a specific set of pages, by hacking the reference bits in the page table. Is there any system call or any other way to unset reference bits (in UNIX-like systems)?
A page table is the data structure used by a virtual memory system in a computer operating system to store the mapping between virtual addresses and physical addresses. (https://en.wikipedia.org/wiki/Page_table)
In unix-like systems there is a bit associated with each page table entry, called "reference" bit, which indicates if a page was accessed since the bit was unset.
The linux kernel unsets these reference bits periodically and checks a while after that to know what pages have been accessed, in order to detect "hot" pages. But this information is very coarse grain and low-precision since it doesn't say anything about the number of accesses and their time.
I want to count accesses to specific pages during shorter epochs by unsetting reference bits then check if the pages have been accessed after a short time.
Therefore, I was wondering if there is any system call or CPU interrupt which provides means to unset "reference bits". Otherwise, I need to dive deep into kernel to see what goes on down there.
There is no API for resetting the page reference bits. Page management is a very twitchy aspect of kernel tuning and no one wants to upset it. Of course you could modify the kernel to your needs.
Instead, you might look into Valgrind which is a debugging and profiling tool for running a single program. Ordinarily it detects subtle memory errors such as detecting use of a dynamic memory block after it has been freed.
If you need page management information for the system as a whole, I think the most expedient solution is hacking the kernel.

How absolute code get generated

I have gone through with memory managment concepts of Operating system concept of Galvin , I have read a statment :
If you know at compile time where the process will reside in memory, then absolute code can be generated.
How at compile time processor got to know at which memory location in main memory process is going to store.
Can someone explain , what is the exactly does it means , if we know at compile time where process will reside in memory ,
As memory be allocated when program is moving from ready to running state .
Generally, machine code isn't position-independent. In order to be able to load it at an arbitrary starting address and run there one needs some extra information about the machine code (e.g. where it has addresses to the various parts of itself), so it can be adjusted to the arbitrary position.
OTOH, if the code is always going to be loaded at the same fixed address, you don't need any of that extra information and processing.
By absolute he means fixed + final, already adjusted to the appropriate address.
The processor does not "know" anything. You "tell" it.
I don't know exactly what he means with "absolute code", depending which operating system you use, the program with it's code and data will be loaded to a virtual address and executed from there.
Beside of this not the compiler but the linker sets the address where the program will be loaded to.
Modern operating systems like Linux are using Address Space Layout Randomization to avoid having one static address where every program is loaded and to avoid the possibility of exploiting software flaws.
If you're writing your own operating system maybe the osdev.org wiki could be a good ressource for you. If you can read/speak german I recommened lowlevel.eu either.

Why memory-mapped files are always mapped at page boundaries?

This is my first question here; I'm not sure if it is off-topic.
While self-studying, I have found the following statement regarding Operating Systems:
Operating systems that allow memory-mapped files always require files to be mapped at page boundaries. For example, with 4-KB page, a file can be mapped in starting at virtual address 4096, but not starting at virtual address 5000.
This statement is explained in the following way:
If a file could be mapped into the middle of page, a single virtual page would
need two partial pages on disk to map it. The first page, in particular, would
be mapped onto a scratch page and also onto a file page. Handling a page
fault for it would be a complex and expensive operation, requiring copying of
data. Also, there would be no way to trap references to unused parts of pages.
For these reasons, it is avoided.
I would like to ask for help to understand this answer. Particularly, what does it mean to say that "a single virtual page would need two partial pages on disk to map it"? From what I found about memory-mapped files, virtual pages are mapped to files on disk, and not to a paging file. Is this what is meant by "partial page"?
Also, what is meant by "scratch page" here? I've tried to look up this term on books (Tanenbaum's "Modern Operating Systems" and "Structured Computer Organization") and on the Web, but haven't found it.
First of all, when reading books and documentation always try to look critically at what you see. Sometimes authors tend to use language like "there is no other way" just to promote the solution that they are describing. Other ways are always possible.
Now to the matter. Modern operating systems always have a disk location for every allocated memory page. This makes sense. Once it will be necessary to discard the page in the memory - it is already clear where to put this page if it is 'dirty' or just discard it if it is not modified. This strategy is widely accepted. Although alternative policies are possible also.
The disk location can be either paging file or memory mapped file. The most common use of the memory mapped files - executables and dlls. They are (almost) never modified. If a page with the code is not used for some time - discard it. If control will come there - reread it from the file.
In the abstract that you mentioned, they say would need two partial pages on disk to map it. The first page, in particular, would be mapped onto a scratch page. They tend to present situation like there is only one solution here. In fact, it is possible to allocate page in a paging file for such combined page and handle appropriate data copying. It is also possible not to have anything in the paging file for such page and assemble this page from files using transient page. In 99% of cases disk controller can read/write only from/to the page boundary. This means that you need to read from the first file to memory page, from the second file to the transient page. Copy data from the transient page and immediately discard it.
As you see, it is perfectly possible to combine several files in one page. There is no principle problem here. Although algorithms for handling this solution will be more complex and they will consume more CPU clocks. Reconstructing such page (if it will be discarded) will require reading from several different files. In our days 4kb is rather small quantity. Saving 2kb is not a huge gain. In my opinion, looking at the benefits and the cost I would say that benefits are not significant enough.
Virtual address pages (on every machine I've ever heard of) are aligned on page sized boundaries. This is simply because it makes the math very easy. On x86, the page size is 4096. That is exactly 12 bits. To figure out which virtual page an address is referring to, you simply shift the address to the right by 12. If you were to map a disk block (assume 4096 bytes) to an address of 5000, it would start on page #1 (5000 >> 12 == 1) and end on page #2 (9095 >> 12 == 2).
Memory mapped files work by mapping a chunk of virtual address space to the file, but the data is loaded on demand (indeed, the file could be far larger than physical memory and may not fit). When you first access the virtual address, if the data isn't there (i.e. it's not in physical memory). The processor will fault and the OS has to fetch the data. When you fetch the data, you need to fetch all of the data for the page, or else you wouldn't be able to turn off the fault. If you don't have the addresses aligned, then you'd have to bring in multiple disk blocks to fill the page. You can certainly do this, it's just messy and inefficient.

make_request and queue limits

I'm writing a linux kernel module that emulates a block device.
There are various calls that can be used to tell the block size to the kernel, so it aligns and sizes every request toward the driver accordingly. This is well documented in the "Linux Device Drives 3" book.
The book describes two methods of implementing a block device: using a "request" function, or using a "make_request" function.
It is not clear, whether the queue limit calls apply when using the minimalistic "make_request" approach (which is also the more efficient one if the underlying device is has really no benefit from sequential over random IO, which is the case with me).
I would really like to get the kernel to talk to me using 4K block sizes, but I see smaller bio-s hitting my make_request function.
My question is that should the blk_queue_limit_* affect the bio size when using make_request?
Thank you in advance.
I think I've found enough evidence in the kernel code that if you use make_request, you'll get correctly sized and aligned bios.
The answer is:
You must call blk_queue_make_request first, because it sets queue limits to defaults. After this, set queue limits as you'd like.
It seems that every part of the kernel submitting bios are do check for validity, and it's up to the submitter to do these checks. I've found incomplete validation in submit_bio and generic_make_request. But as long as no one does tricks, it's fine.
Since it's a policy to submit correct bio's, but it's up to the submitter to take care, and no one in the middle does, I think I have to implement explicit checks and fail the wrong bio-s. Since it's a policy, it's fine to fail on violation, and since it's not enforced by the kernel, it's a good thing to do explicit checks.
If you want to read a bit more on the story, see http://tlfabian.blogspot.com/2012/01/linux-block-device-drivers-queue-and.html.

How to interpret Windows Task Manager?

I run Windows 7 RC1, which uses the same WTM from Vista. When i look at the processes, there some columns I'm not sure what the differences are:
Memory - working set
Memory - private working set
Memory - commit size
can anyone tell me what they are?
From the following article, under the section Types of Memory Usage:
There are two main types of memory usage: working set and private working set. The private working set is the amount of memory used by a process that cannot be shared among other processes, while working set includes the memory shared by other processes.
That may sound confusing, so let’s try to simplify it a bit. Lets pretend that there are two kids who are coloring, and both of the kids have 5 of their own crayons. They decide to share some of their crayons so that they have more colors to choose from. When each child is asked how many crayons they used, both of them said they used 7 crayons, because they each shared 2 of their crayons.
The point of that metaphor is that one might assume that there were a total of 14 crayons if they didn’t know that the two kids were sharing, but in reality there were only 10 crayons available. Here is the rundown:
Working Set: This includes all of the shared crayons, so the total would be 14.
Private Working Set: This includes only the crayons that each child owns, and doesn’t reflect how many were actually used in each picture. The total is therefore 10.
This is a really good comparison to how memory is measured. Many applications reuse code that you already have on your system, because in the end it helps reduce the overall memory consumption. If you are viewing the working set memory usage you might get confused because all of your running processes might actually add up to more than the amount of RAM you have installed, which is the same problem we had with the crayon metaphor above. Naturally the working set will always be larger than the private working set.
Working set:
Working set is the subset of virtual pages that are resident in physical memory only; this will be a partial amount of pages from that process.
Private working set:
The private working set is the amount of memory used by a process that cannot be shared among other processes
Commit size:
Amount of virtual memory that is reserved for use by a process.
And at microsoft.com you can find more details about other memory types.
'Working Set' is the amount of memory that the process currently has in physical RAM. In other words, accessing any pages in the 'Working Set' will not cause a page fault since the page is in RAM.
As for the other two, I'm not 100% sure, probably 'Working Set' contains sharable memory, such as memory mapped files, and 'Private Working Set' contains only pages that the process can use and are not shareable.
Have look at this site and search for the speaker 'Dave Solomon'. There is an excellent webcast that he gave which explains about Windows memory, and he mentions working set, commit sizes, and other memory terms.
EDIT:
Those site links are indeed dead :(
Instead, you can search Google for
vimeo david solomon windows
Those same videos look to be available on Vimeo now, which is cool.
If you open the Resource Monitor from the WTM, mousing over the various column headings of the interesting process displays a pretty informative tool tip.
e.g.
Commit(KB): Amount of virtual memory reserved by the operating system for the process in KB.
etc.
This article at Microsoft seems to be the most detailed.
Edit Oct 2018: new link

Resources