Page Table and Cache Hit Rates - caching

I made a post about page table and the amount of registers needed for a multi level page table and fount out that every page table, regardless of the level, only needs one register to access the top of the page table. But my second question has not been answered.
How will cache (L1-L3) in the processor affect memory reference access to page table? Will the majorities miss or hit? Why does it happen? I am told that this topic may have different answers based on the architectures used, so maybe general answers would be fine.
I tried to find references for this, but I cannot find it. Might say that I am really beginner in OS.
The link to my previous question:
Page Table Registers and Cache
Edit: Because of TLB, the access of memory reference to Page Table can be reduced, causing it to get more hits. Is it correct? Help please :D

The basic idea (without any caches of any kind) is that when you access memory the CPU:
finds the highest level page table (e.g. from the virtual address and a control register) and fetches the highest level page table entry from RAM
finds the next level page table (e.g. from the virtual address and highest level page table entry) and fetches the next level page table entry from RAM; and so on (repeated for each level of page tables) until the CPU reaches the lowest level page table entry.
finds the physical address (e.g. from the virtual address and lowest level page table entry), and fetches the data from that physical address
This is obviously slow. To speed it up there are multiple "cache like things":
a) The caches themselves. E.g. rather than fetching anything from RAM the CPU may fetch from cache instead (including when CPU fetches page table entries). Note that typically there's multiple levels of caches (e.g. L1 data cache, L2 unified cache, ...) and this may apply to some caches and not others (e.g. CPU won't fetch page table entries from "L1 instruction cache" but probably will fetch them from "L3 unified cache").
b) The TLBs (Translation Look-aside Buffers); which mostly cache the lowest level page table entry. This allows almost all of the work to be skipped (if there's a "TLB hit").
c) Higher level translation caches. Modern CPUs have additional caches that cache an intermediate level of the page table heirarchy (e.g. maybe the 3rd level page table entry if there's 4 or more levels, and not the highest or lowest level entry). These reduce the cost of "TLB miss" (if there's a "higher level translation hit") by allowing some of the work to be skipped.

Related

Will page table be put in CPU cache?

According to my understanding, load/store operations would access some data of a virtual memory address(vaddr), and this vaddr would be translated into physical address(paddr) in order to be fulfilled by the memory hierarchy.
The translation process would first look up in TLB, if no match is found, a multi level(?) page table look up is then triggered.
My question is: will the page table be put in L1D cache, L2 cache or LLC, besides the quite limited TLB entries?

cache miss, a TLB miss and page fault

Can someone clearly explain me the difference between a cache miss, a tlb miss and page fault, and how do these affect the effective memory access time?
Let me explain all these things step by step.
The CPU generates the logical address, which contains the page number and the page offset.
The page number is used to index into the page table, to get the corresponding page frame number, and once we have the page frame of the physical memory(also called main memory), we can apply the page offset to get the right word of memory.
Why TLB(Translation Look Aside Buffer)
The thing is that page table is stored in physical memory, and sometimes can be very large, so to speed up the translation of logical address to physical address , we sometimes use TLB, which is made of expensive and faster associative memory, So instead of going into page table first, we go into the TLB and use page number to index into the TLB, and get the corresponding page frame number and if it is found, we completely avoid page table( because we have both the page frame number and the page offset) and form the physical address.
TLB Miss
If we don't find the page frame number inside the TLB, it is called a TLB miss only then we go to the page table to look for the corresponding page frame number.
TLB Hit
If we find the page frame number in TLB, its called TLB hit, and we don't need to go to page table.
Page Fault
Occurs when the page accessed by a running program is not present in physical memory. It means the page is present in the secondary memory but not yet loaded into a frame of physical memory.
Cache Hit
Cache Memory is a small memory that operates at a faster speed than physical memory and we always go to cache before we go to physical memory. If we are able to locate the corresponding word in cache memory inside the cache, its called cache hit and we don't even need to go to the physical memory.
Cache Miss
It is only after when mapping to cache memory is unable to find the corresponding block(block similar to physical memory page frame) of memory inside cache ( called cache miss ), then we go to physical memory and do all that process of going through page table or TLB.
So the flow is basically this
1.First go to the cache memory and if its a cache hit, then we are done.
2. If its a cache miss, go to step 3.
3. First go to TLB and if its a TLB hit, go to physical memory using physical address formed, we are done.
4. If its a TLB miss, then go to page table to get the frame number of your page for forming the physical address.
5. If the page is not found, its a page fault.Use one of the page replacement algorithms if all the frames are occupied by some page else just load the required page from secondary memory to physical memory frame.
End Note
The flow I have discussed is related to virtual cache(VIVT)(faster but not sharable between processes), the flow would definitely change in case of physical cache(PIPT)(slower but can be shared between processes). Cache can be addressed in multiple ways. If you are willing to dive deeply have a look at this and this.
This diagram might help to see what will happen when there is a hit or a miss.
Just imagine a process is running and requires a data item X.
At first cache memory will be checked to see if it has the requested data item, if it is there(cache hit), it will be returned.If it is not there(cache miss), it will be loaded from main memory.
If there is a cache miss main memory will be checked to see if there is page containing the requested data item(page hit) and if such page is not there (page fault), the page containing the desired item has to be brought into main memory from disk.
While processing the page fault TLB will be checked to see if the desired page's frame number is available there (TLB hit) otherwise (TLB miss)OS has to consult page table for servicing page fault.
Time required to access these types memories:
cache << main memory << disk
Cache access requires least time so a hit or miss at certain level drastically changes the effective access time.
What causes page faults? Is it always because the memory has been
moved to hard disk? Or just moved around for other applications?
Well, it depends. If your system does not support multiprogramming(In a multiprogramming system there are one or more programs loaded in main memory which are ready to execute), then definitely page fault has occurred because memory has been moved to hard disk.
If your system does support multiprogramming, then it depends on whether your operating system uses global page replacement or local page replacement. If it uses global, then yes there is a chance that memory has been moved around for other applications. But in local, the memory has been moved back to hard disk. When a process incurs a page fault, a local page replacement algorithm selects for replacement some page that belongs to that same process. On the other hand a global replacement algorithm is free to select any page in from the entire pool of frames. This discussion about these pops up more when dealing with thrashing.
I am confused of the difference between TLB miss and page faults.
TLB miss occurs when the page table entry required for conversion of virtual address to physical address is not present in the TLB(translation look aside buffer). TLB is like a cache, but it does not store data rather it stores page table entries so that we can completely bypass the page table in case of TLB hit as you can see in the diagram.
Is page fault a crash? Or is it the same as a TLB miss?
Neither of them is a crash as crash is not recoverable. But it is well known that we can recover from both page fault and TLB miss without any need for aborting the process execution.
The Operating system uses virtual memory and page tables maps these virtual address to physical address. TLB works as a cache for such mapping.
program >>> TLB >>> cache >>> Ram
A program search for a page in TLB, if it doesn't find that page it's a TLB miss and then further looks for the page in cache.
If the page is not in cache then it's a cache miss and further looks for the page in RAM.
If the page is not in RAM, then it's a page fault and program look for the data in secondary storage.
So, typical flow would be
Page Requested >> TLB miss >> cache miss >> page fault >> looks in secondary memory.

CPU cycle speed

Finding the latencies of L1/L2/L3 caches is easy:
Approximate cost to access various caches and main memory?
but I am interested in what the cost is (in CPU cycles) for translating a virtual address to physical page address when:
There is a hit in the L1 TLB
There is a miss in the L1 TLB but a hit in the L2 TLB
There is a miss in the L2 TLB and a hit in the page table
(I dont think there can be a miss in the page table can there? If there can, cost of this)
I did find this:
Data TLB L1 size = 64 items. 4-WAY. Miss penalty = 7 cycles. Parallel miss: 1 cycle per access
TLB L2 size = 512 items. 4-WAY. Miss penalty = 10 cycles. Parallel miss: 21 cycle per access
Instruction TLB L1 size = 64 items per thread (128 per core). 4-WAY
PDE cache = 32 items?
http://www.7-cpu.com/cpu/SandyBridge.html
but it doesn't mention the cost of a hit/accessing the relevant TLB cache?
Typically the L1 TLB access time will be less than the cache access time to allow tag comparison in a set associative, physically tagged cache. A direct mapped cache can delay the tag check by assuming a hit. (For an in-order processor, a miss with immediate use of the data would need to wait for the miss to be handled, so there is no performance penalty. For an out-of-order processor, correcting for such wrong speculation can have noticeable performance impact. While an out-of-order process is unlikely to use a direct mapped cache, it may use way prediction which can behave similarly.) A virtually tagged cache can (in theory) delay TLB access even more since the TLB is only needed to verify permissions not to determine a cache hit and the handling of permission violations is generally somewhat expensive and rare.
This means that L1 TLB access time will generally not be made public since it will not influence software performance tuning.
L2 hit time would be equivalent to the L1 miss penalty. This will vary depending on the specific implementation and may not be a single value. E.g., If the TLB uses banking to support multiple accesses in a single cycle, bank conflicts can delay accesses, or if rehashing is used to support multiple page sizes, a page of the alternate size will take longer to find (both of these cases can accumulate delay under high utilization).
The time required for an L2 TLB fill can vary greatly. ARM and x86 use hardware TLB fill using a multi-level page table. Depending on where page table data can be cached and whether there is a cache hit, the latency of a TLB fill can be between the latency of a main memory access for each level of the page table and the latency of the cache where the page table data is found for each level (plus some overhead).
Complicating this further, more recent Intel x86 have paging-structure caches which allow levels of the page table to be skipped. E.g., if a page directory entry (an entry in a second level page table which points to a page of page table entries) is found in this cache, rather than starting from the base of the page table and doing four dependent look-ups only a single look-up is required.
(It might be worth noting that using a page the size of the virtual address region covered by a level of the page table (e.g., 2 MiB and 1 GiB for x86-64), reduces the depth of the page table hierarchy. Not only can using such large pages reduce TLB pressure, but it can also reduce the latency of a TLB miss.)
A page table miss is handled by the operating system. This might result in the page still being in memory (e.g., if the write to swap has not been completed) in which case the latency will be relatively small. (The actual latency will depend on how the operating system implements this and on cache hit behavior, though cache misses both for the code and the data are likely since paging is an uncommon event.) If the page is no longer in memory, the latency of reading from secondary storage (e.g., a disk drive) is added to the software latency of handling an invalid page table entry (i.e., a page table miss).

In operating system, How MMU searches for virtual page number as key in page table

1)So lets say a single level page table
3)A TLB miss happens
3)The required page table is at main memory
Question : Does MMU always fetch the page table required to a number of registers inside it so that fast hardware search like TLB can be performed? I guess no that would be costly hardware
4)MMU fetch the physical page number (I guess MMU must be saved it with a format like high n-bits as virtual page no. and low m bits as physical page frame no. Please correct and explain if I am wrong)
Question: I guess there has to be a key-value map with Virtual page no as key and physical frame no. as value. How MMU search for the key in the page table. If it is a s/w like linear search than it would be very costly.
5)With hardware it appends offset bits to page frame no.
and finally a read occurs for physical address.
So this question is bugging me a lot, how the MMU performs the search for given key(virtual page entry) in page table?
The use of registers for a page table is satisfactory if the page
table is reasonably small(for example, 256 entries). Most contemporary
computers, however, allow the page table to be very large (for
example, 1 million entries). For these machines, the use of fast
registers to implement the page table is not feasible. Rather, the
page table is kept in main memory, and a page table base register (PTBR) points to the page table.
Changing page tables requires changing only this one register,
substantially reducing context-switch time.
The problem with this
approach is the time required to access a user memory location. If we
want to access location i, we must first index into the page table,
using the value in the PTBR offset by the page number for i. This task
requires a memory access. It provides us with the frame number, which
is combined with the page offset to produce the actual address. We can
then access the desired place in memory. With this scheme, two memory
accesses are needed to access a byte (one for the page-table entry,
one for the byte). Thus, memory access is slowed by a factor of 2.
This delay would be intolerable under most circumstances. We might as
well resort to swapping!
The standard solution to this problem is to
use a special, small, fastlookup hardware cache, called a translation look-aside buffer(TLB) . The
TLB is associative, high-speed memory. Each entry in the TLB consists
of two parts: a key (or tag) and a value. When the associative memory
is presented with an item, the item is compared with all keys
simultaneously. If the item is found, the corresponding value field is
returned. The search is fast; the hardware, however, is expensive.
Typically, the number of entries in a TLB is small, often numbering
between 64 and 1,024.
Source:Operating System Concepts by Silberschatz et al. page 333

What is TLB shootdown?

What is a TLB shootdown in SMPs?
I am unable to find much information regarding this concept. Any good example would be very much appreciated.
A TLB (Translation Lookaside Buffer) is a cache of the translations from virtual memory addresses to physical memory addresses. When a processor changes the virtual-to-physical mapping of an address, it needs to tell the other processors to invalidate that mapping in their caches.
That process is called a "TLB shootdown".
A quick example:
You have some memory shared by all of the processors in your system.
One of your processors restricts access to a page of that shared memory.
Now, all of the processors have to flush their TLBs, so that the ones that were allowed to access that page can't do so any more.
The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown.
I think the question demands a more detailed answer.
page table: a data structure that stores the mapping between virtual memory (software) and physical memory (hardware)
however, the page table can be quite large and traversing the page table (to find the virtual address's corresponding physical address) can be a time consuming process. To make this process faster, a cache called the TLB (Translation Lookaside Buffer) is used, which stores the recently accessed virtual memory addresses.
As can be clearly seen the TLB entries need to be in sync with their respective page table entries at all times. Now the TLBs are a per-core cache ie. every core has its own TLB.
Whenever a page table entry is modified by any of the cores, that particular TLB entry is invalidated in all of the cores. This process is called TLB shootdown.
TLB flushing can be triggered by various virtual memory operations that change the page table entries like page migration, freeing pages etc.

Resources