Working Set Algorithm and Virtual Memory - virtual-memory

Working Set algorithm: There are 2 processes, each one of them has its own working set window. According to theory, in that window are stored the Δ most recent pages that the process has asked for.
My problem is this: When a page must be brought to the window, are we moving that page directly from the disk (Disk -> Windown) meaning there's no need for virtual memory; or, there should be an inverted page table, that stores the pages, so that we move it from there (Disk -> Inverted Page Table -> Window).
Long question short: Is the WS algorithm connected (in any way) with the Inverted Page Table
-Thanks

It sounds like you are confused here.
1) Inverted Page Tables are simply a mechanism for implementing page tables (logical memory translation). For learning how virtual memory works, you can ignore inverted page tables.
If you move a page from disk to physical memory, you are using virtual memory.
So, no the WS is not connected with Inverted Page Tables.

Related

Page Table and Cache Hit Rates

I made a post about page table and the amount of registers needed for a multi level page table and fount out that every page table, regardless of the level, only needs one register to access the top of the page table. But my second question has not been answered.
How will cache (L1-L3) in the processor affect memory reference access to page table? Will the majorities miss or hit? Why does it happen? I am told that this topic may have different answers based on the architectures used, so maybe general answers would be fine.
I tried to find references for this, but I cannot find it. Might say that I am really beginner in OS.
The link to my previous question:
Page Table Registers and Cache
Edit: Because of TLB, the access of memory reference to Page Table can be reduced, causing it to get more hits. Is it correct? Help please :D
The basic idea (without any caches of any kind) is that when you access memory the CPU:
finds the highest level page table (e.g. from the virtual address and a control register) and fetches the highest level page table entry from RAM
finds the next level page table (e.g. from the virtual address and highest level page table entry) and fetches the next level page table entry from RAM; and so on (repeated for each level of page tables) until the CPU reaches the lowest level page table entry.
finds the physical address (e.g. from the virtual address and lowest level page table entry), and fetches the data from that physical address
This is obviously slow. To speed it up there are multiple "cache like things":
a) The caches themselves. E.g. rather than fetching anything from RAM the CPU may fetch from cache instead (including when CPU fetches page table entries). Note that typically there's multiple levels of caches (e.g. L1 data cache, L2 unified cache, ...) and this may apply to some caches and not others (e.g. CPU won't fetch page table entries from "L1 instruction cache" but probably will fetch them from "L3 unified cache").
b) The TLBs (Translation Look-aside Buffers); which mostly cache the lowest level page table entry. This allows almost all of the work to be skipped (if there's a "TLB hit").
c) Higher level translation caches. Modern CPUs have additional caches that cache an intermediate level of the page table heirarchy (e.g. maybe the 3rd level page table entry if there's 4 or more levels, and not the highest or lowest level entry). These reduce the cost of "TLB miss" (if there's a "higher level translation hit") by allowing some of the work to be skipped.

Does mmap directly access the page cache, or a copy of the page cache?

To ask the question another way, can you confirm that when you mmap() a file that you do in fact access the exact physical pages that are already in the page cache?
I ask because I’m doing testing on a 192 core machine with 1TB of RAM, on a 400GB data file that is pre-cached into the page cache prior to the test (by just dropping the cache, then doing md5sum on the file).
Initially, I had all 192 threads each mmap the file separately, on the assumption that they would all get (basically) the same memory region back (or perhaps the same memory region but somehow mapped multiple times). Accordingly, I assumed two threads using two different mappings to the same file would both have direct access to the same pages. (Let’s ignore NUMA for this example, though obviously it’s significant at higher thread counts.)
However, in practice I found performance would get terrible at higher thread counts when each thread separately mmapped the file. When we removed that and instead just did a single mmap that was passed into the thread (such that all threads just directly access the same memory region), then performance improved dramatically.
That’s all great, but I’m trying to figure out why. If in fact mmapping a file just grants direct access to the existing page cache, then I would think that it shouldn’t matter how many times you map it — it should all go to the exact same place.
But given that there was such a performance cost, it seemed to me that in fact each mmap was being independently and redundantly populated (perhaps by copying from the page cache, or perhaps by reading again from disk).
Can you comment on why I was seeing such different performance between shared access to the same memory, versus mmapping the same file?
Thanks, I appreciate your help!
I think I found my answer, and it deals with the page directory. The answer is yes, two mmapped regions of the same file will access the same underlying page cache data. However, each mapping needs to independently map each of the virtual pages to the physical pages -- meaning 2x as many entries in the page directory to access the same RAM.
Basically, each mmap() creates a new range in virtual memory. Every page of that range corresponds to a page of physical memory, and that mapping is stored in a hierarchical page directory -- with one entry per 4KB page. So every mmap() of a large region generates a huge number of entries in the page directory.
My guess is it doesn't actually define them all up front, which is why mmap() is instant to call even for a giant file. But over time it probably has to establish those entries as there are faults on the mmapped range, meaning over the course of time it gets filled out. This extra work to populate the page directory is probably why threads using different mmaps are slower than threads sharing the same mmap. And I bet the kernel needs to erase all those entries when unmapping the range -- which is why unmmap() is so slow.
(There's also the translation lookaside buffer, but that's per-CPU, and so small I don't think that matters much here.)
Anyway, it sounds like re-mapping the same region just adds extra overhead, for what seems to me like no gain.

What's the relations between physical pages and pages in the paging file?

Under Windows, the kernel can swap a physical memory page to a page in the paging file.
For simplicity, we assume there is only one paging file.
As far as I understand, the paging file consists of pages which have the same size of a physical memory page. i.e. 4K.
I just wonder:
How does the kernel know which page in the paging file is free to store?
(Free here means the page in the paging file doesn't previously store another physical memory page.)
At the risk of gross oversimplification . . . the usual approach in implementing virtual memory is that disk is the primary storage. Unless there is a mapping to a file, a virtual page does not exist. That mapping remains fixed for the life of the process.
The virtual memory on disk gets mapped to physical memory when available.
The kernel maintains some data structure (e.g. a bitmap) to indicate the free areas of the page file and other structures to maintain the mapping of logical addresses to the files.
I believe you are asking about page replacement algorithms within memory management.
When the operating system needs to save a new page in memory and keep track of its information on the paging file (also known as the page table), there's no guarantee that there will be a free spot --meaning that other pages' information might have taken up all of it. In that case, the OS will have to evict an existing page. The OS doesn't need free space since, if there isn't any, it will make it.
If you're interested in learning more (this is a pretty large topic), you may find the lecture notes from NYU's "Operating Systems" class helpful. This is the demand paging unit, and further below you can read about a few page replacement algorithms ("WS Clock" and "Aging" are probably the most important).
Hopefully this is helpful!

In operating system, How MMU searches for virtual page number as key in page table

1)So lets say a single level page table
3)A TLB miss happens
3)The required page table is at main memory
Question : Does MMU always fetch the page table required to a number of registers inside it so that fast hardware search like TLB can be performed? I guess no that would be costly hardware
4)MMU fetch the physical page number (I guess MMU must be saved it with a format like high n-bits as virtual page no. and low m bits as physical page frame no. Please correct and explain if I am wrong)
Question: I guess there has to be a key-value map with Virtual page no as key and physical frame no. as value. How MMU search for the key in the page table. If it is a s/w like linear search than it would be very costly.
5)With hardware it appends offset bits to page frame no.
and finally a read occurs for physical address.
So this question is bugging me a lot, how the MMU performs the search for given key(virtual page entry) in page table?
The use of registers for a page table is satisfactory if the page
table is reasonably small(for example, 256 entries). Most contemporary
computers, however, allow the page table to be very large (for
example, 1 million entries). For these machines, the use of fast
registers to implement the page table is not feasible. Rather, the
page table is kept in main memory, and a page table base register (PTBR) points to the page table.
Changing page tables requires changing only this one register,
substantially reducing context-switch time.
The problem with this
approach is the time required to access a user memory location. If we
want to access location i, we must first index into the page table,
using the value in the PTBR offset by the page number for i. This task
requires a memory access. It provides us with the frame number, which
is combined with the page offset to produce the actual address. We can
then access the desired place in memory. With this scheme, two memory
accesses are needed to access a byte (one for the page-table entry,
one for the byte). Thus, memory access is slowed by a factor of 2.
This delay would be intolerable under most circumstances. We might as
well resort to swapping!
The standard solution to this problem is to
use a special, small, fastlookup hardware cache, called a translation look-aside buffer(TLB) . The
TLB is associative, high-speed memory. Each entry in the TLB consists
of two parts: a key (or tag) and a value. When the associative memory
is presented with an item, the item is compared with all keys
simultaneously. If the item is found, the corresponding value field is
returned. The search is fast; the hardware, however, is expensive.
Typically, the number of entries in a TLB is small, often numbering
between 64 and 1,024.
Source:Operating System Concepts by Silberschatz et al. page 333

What is Working Set?

I'm confused with the concept of Working Set ,while reading the Memory Management code of the Windows Research Kernel.
The "working set" is short hand for "parts of memory that the current algorithm is using" and is determined by which parts of memory the CPU just happens to access. It is totally automatic to you. If you are processing an array and storing the results in a table, the array and the table are your working set.
This is discussed because the CPU will automatically store accessed memory in cache, close to the processor. The working set is a nice way to describe the memory you want stored. If it is small enough, it can all fit in the cache and your algorithm will run very fast. On the OS level, the kernel has to tell the CPU where to find the physical memory your application is using (resolving virtual addresses) every time you access a new page (typically 4k in size) so also you want to avoid that hit as much as possible.
See What Every Programmer Should Know About Memory - PDF for graphs of algorithm performance vs size of working set (around page 23) and lots of other interesting info.
Basically - write your code to access the smallest amount of memory possible (i.e classes are small, not too many of them), and try to ensure tight loops run on a very very small subset of that memory.
The "working set" is an informal term meaning the memory that's being accessed "frequently" (for some definition of frequently) by an application or set of applications. Applications may also allocate memory that they access infrequently (no more than once every few dozen seconds, perhaps not even once an hour); this would be outside of the working set.
An example might be if you have two Firefox Windows, a minimized one that you haven't looked at for several hours, and an open one that you're browsing in right now. The memory used to store the data associated with the open window is going to be in the working set; the memory used to store the data associated with the window that's not open and that you haven't looked at for several hours is not in the working set.
This is mainly used in discussions about whether you have enough RAM in your system. If your working set is smaller than your RAM, you can work comfortably, because the data your program or programs frequently access is always in memory. If your working set is larger than your RAM, the operating system will be constantly swapping pages out to disk to make room to swap in pages that an application wants to access; these swapped out pages, being in the working set, will almost immediately be needed again, meaning that you've got to take other pages and write them out to disk, and it just goes on and on like this. This is referred to as "thrashing."
If you're not reading or writing many files, your disk light is on all the time, and your system feels very slow, that's a pretty good sign that you're thrashing.
Roughly, the working set is the areas of memory in active use. http://en.wikipedia.org/wiki/Working_set
All the memory your program is using that is not in the "working set" is marked for swap to disk. When the operating system needs more memory for other work it will try to keep the working set of each program in memory, but everything else is up for grabs.
The working set is the set of pages that are physically in memory at any one time. Although the working set if quoted and displayed in kilobytes the smallest working set you can have is 4k (8K on Itanium) as thats the size of a page in Windows.
To see the working set of a process look at the "Mem Usage" column in the task manager's "Processes" tab.
If you're running a .NET app you can watch the working set be reduced looking at the process in the task manager process tab and then minimizing the application. Its working set is dramatically reduced as Windows swaps it out to the page file (since the process is assumed to not be "working" as much).
The term "Working Set" is related to the Working Set Page Replacement Algorithm (the algorithm is very well explained in this article of Andrew Tanenbaum).
So the working set is a number of pages a process demands to be loaded from memory during execution. The working set consists of the most recently loaded page and the k pages that have been loaded before. When it comes to make a frame free for a new demanded page, only pages that are not in the working set may be swapped.
Working set is a subset of virtual pages resident in physical memory.

Resources