Does memory address translation need extra access to memory?

Does memory address translation need extra access to memory? - virtual-memory

I've got a question about virtual memory management, more specifically, the address translation.
When an application runs, the CPU receives instructions containing virtual memory addresses, and translates them into physical addresses via the page table.
My question is, since the page table also aside at a memory block, does that means the CPU has to access the memory twice in a single memory-access instruction? If the answer is no, then how does this actually work? Which part did I miss?
Could anyone give me some details about this?

As usual the answer is neither yes or no.
Worst case you have to do a walk of the page table, which is indeed stored in (some kind of) memory, this is not necessarily only one lookup, it can be multiple lookups, see for example a two-level table (example from wikipedia).
However, typically this page table is accompanied by a hardware assist called the translation lookaside buffer, this is essentially a cache for the page table, the lookup process can be seen in this image. It works just as you would expect a cache too work, if a lookup succeeds you happily continue with the physical fetch, if it fails you proceed to the aforementioned page walk and you update the cache afterwards.
This hardware assist is usually implemented as a CAM (Content Addressable Memory), something that's most used in network processing but is also very useful here. It is a memory-component that does not do the lookup based upon an address but based upon 'content', or any generic key (the keys dont' have to be contiguous, incrementing numbers). In this case the key would be your virtual address, and the resulting memory lookup would be your physical address. As this CAM is a separate component and as it is very fast you could state that as long as you hit it you don't incur any extra memory overhead for virtual -> physical address translation.
You could ask why they don't put the whole page table in a CAM? Quite simply, CAM's are both quite expensive and more importantly quite energy-hungry, so you don't want to make them too big (we wouldn't want a laptop that requires 1KW to run do we?).

Sometimes.
The MMU contains a cache of virtual to physical address mapping, called a TLB (Translation Lookaside Buffer).
If the page in question is not in the TLB (a TLB miss), then it needs to load the relevant piece of page table from main memory into that cache first, which will need additional memory access.
Finally, if the page cannot be found at all, a trap is issued to the CPU (a page fault), and the CPU have an opportunity to fix this - e.g. allocate memory, load the piece from a file, swap space and similar.
The details on how this is done varies between architectures, on some, the TLB miss also involves the CPU to configure the TLB, though on most this is automatic. (but the CPU would have to flush the TLB when doing a context switch, and load a new pagetable for e.g. a new process)
More info e.g. here https://www.kernel.org/doc/gorman/html/understand/understand006.html

Related

Avoiding Translation Look aside buffer ( TLB ) pollution when mmap()

When we want to write a data item, the block containing the data is brought into the cache first and data item is written into the cache. This can cause cache pollution. To avoid this, Intel has introduced no temporal instructions.
If I'm going to be using mmap() to write data to the file and never going to read again, is it possible to avoid TLB entry creation for this ? Is there anything instruction similar to non temporal instructions available ?

TLB entries are needed by the CPU to map from the virtual address to the physical address, so it is not possible to avoid them with mmap() or any similar API.
Even if it were possible to avoid storing the mapping in the TLB, every access to the mapped memory would need to reload the corresponding entries from the page tables, so the performance would be much worse.
Non-temporal accesses make sense only for stores, but the page table entries are read.

Can someone explain this diagram on Paging (virtual memory) to me?

I've been trying to understand virtual memory but when I get into the real specifics of it I just get confused. I understand (or feel like I do) the fact that virtual memory is a way for a process to "think" that it has a certain amount of memory allocated to it. The virtual address space is partitioned into equal sized pages, the physical memory is partitioned into equal sized frames, and the pages map to the frames.
But like..when does this happen? In this diagram, the CPU is running Program P. That means that a part of P was already in the physical memory, correct? (Since the cpu only has access to the physical/main memory). So what exactly is being pointed at by the CPU? I see that it's a page in the virtual memory space, so like..what exactly does this page represent? Is it an instruction? Are we moving an instruction from virtual memory to physical memory, so that more of the program is in physical memory (that hadn't been needed up until that point)? Am I way off? Can someone explain this to me?

The diagram shows the process of translating a virtual address to a physical address.
The fat arrow from Program P to CPU symbolizes the program being "fed" into the CPU.1
The CPU "points" to a virtual address used by an instruction to address a memory location in the program P. It is divided into two parts:
Page Table Index (p): the virtual address contains an index into the page table, which maps a page to a page frame (f). For a description of the mechanism, including multi-level paging, read this.
Offset (o): as you can see, the offset is directly added to the physical address, since paging's smallest addressable unit is a page, not a byte
Finally, the calculated address is used to address a memory location in physical memory.
1 "fed" means "read (pronounced like red) from secondary storage into RAM and executing the program instruction by instruction".

I would not bother trying to understand that diagram because it makes no sense.
It is titled "Paging" but the diagram does not show paging at all.
What you are missing is that there are two steps. First there is logical memory translation (what the diagram kinda, sorta) shows.
Physical memory is arranged in an array of PAGE FRAMES of some fixed size (e.g., 1K, 4K).
Each process has a LOGICAL ADDRESS SPACE consisting of PAGES that match the page frame size.
The logical address space is defined by a PAGE TABLE managed by the operating system. The page table maps logical pages to physical page frames.
If there are two processes (X and Y), logical address Q in process X and address Y map to different physical page frames in most cases.
Usually there is a range of logical addresses that are assigned to the SYSTEM ADDRESS SPACE. Those logical pages map to the same physical page for all processes.
Processes only address logical pages. The have no knowledge of physical pages. The Program Counter register always refers to logical addresses. The CPU automatically translates logical pages to physical page frames. The translation is entirely transparent to the process. The operating system is the only thing that has any knowledge of physical page frame but it only manages the page tables; the CPU does the translation.
Paging is something different but related.
When a program accesses a memory address, the CPU attempts to translate that into a physical address within a page frame. Several steps occur.
The CPU locates the page table entry for the requested page.
There may not be a page page table entry at all for the page. The diagram shows a contiguous logical to physical mapping. That rarely occurs. The logical address space usually has clusters of valid pages with gaps between them. If there is no page table entry for the address, the CPU triggers an exception.
The CPU reads the page table entry to determine if it references a valid page frame.
There may be an entry for the page that has not been mapped to the logical address space (e.g., the first page is usually not mapped to trap null pointer errors). If the page has not been mapped, that triggers an exception.
The CPU checks whether the access is permitted for the current processor mode.
Read/Write/Execute protection can be set for a page and access can be restricted by mode (kernel mode, user mode, or some other mode in some processors).
If the access is not permitted, the CPU triggers an exception.
[Here is where paging comes in] The CPU checks whether the page has been mapped to a physical page frame. If not, the CPU triggers a PAGE FAULT. The OS responds by locating where the page is stored in a paging file, mapping the page table to a physical page frame, loading the data from the page file into memory, and then restarting the instruction.

I guess most of your confusion comes from the fact that the above diagram is a little bit misguided.
Please note that the lack of the IP register and some extra text # both 'Tables' are the problematic ones. The rest is OK.
I show you below the same but fixed diagram which makes more sense.
As others guys already told you, the above diagram is just the translation scheme for the addresses that the CPU use to fetch the actual instructions & operands from P's virtual address space. Did you see it? It's all about addresses and nothing else!!!
It shows you how virtual addresses are managed by the CPU (in a paged scheme) in order to reach the next instruction or operand from the real physical memory using physical addresses.
The explanation of 'Downvoter' is great for that, so no need to add anything else.

Does the address translation of paging decrease memory access performance?

When paging is enabled, some hardware is responsible for translating virtual memory addresses into physical addresses. Known translations are usually kept in some sort of cache, the translation look aside buffer (TLB).
Assuming a memory access where the address translation is cached, is it any slower than directly accessing memory without paging enabled?
I'm wondering about the overhead of that translation, even when it's cached, since the access to that cache will probably also take some (although very short) time. Or is that time planned as part of the clock cycle?
(To make it clear, my question is not about page faults or cache misses of the TLB)

Like everything in life, it depends! :-)
Let's assume, for the sake of simplicity, that (a) we're talking about data rather than instructions (b) all data memory accesses hit the level 1 cache (c) the level 1 data cache is a typical set associative cache.
Each block of the data cache must be identified with an address (less the offset). If the cache uses virtual addresses then no translation need take place and there is no overhead. If the cache uses physical addresses then the address must be translated prior to the data access adding additional latency to the request. Even for a small TLB, I don't think a high performance processor could both translate the address and then complete the cache request within the same cycle. So it's fair to assume that a physically addressed cache does indeed have overhead of address translation.
So virtually addressed caches sound like the better deal, right? Unfortunately it's a double-edged sword. The problem is that virtual memory often allows multiple virtual addresses to map to the same physical address. If in our cache there are two virtual addresses that map to a single physical address, modifying one will not be reflected in the other.
So, there is an option between these two extremes. Still assuming a set associative cache, we can use the virtual address as the index while simultaneously translating the address to a physical one. After, we use the physical address as the tag to access the data. This way we take the TLB translation off the critical path thus achieving similar performance to the virtually addressed cache. It also allows us to avoid this virtual/physical aliasing problem, although it often needs a little extra help from the operating system.
So, you can see that it can be the same or slower, depending on how the cache is configured.

Paging and TLB operating systems

I'm really stuck on this question for my OS class, I don't want someone to just give me the answer though, instead if someone could tell me how to work it out.
Example Question:
This system uses simple paging and TLB
Each memory access requires 80ns
TLB access requires 10ns
TLB hit rate is 80%.
Work out the actual speedup because of the TLB?
NOTE: I changed the memory accessed required and the TLB access requires part of the question because as I said I don't want the answer, just a way to work it out.

In case the virtual address translation is cached in the TLB, all we need is one lookup in the TLB that will give us a physical address, and we are done. The interesting part is if we need to do the page table walk. Think carefully about what the system has to do in case it did not find an address in the TLB (well it already had to do a TLB look-up). Memory access takes 80ns, but how many of them do you need to actually get the physical address? Pretty much every paging architecture follows the schema that page-tables are stored in memory and only the entry point, the address that points to the base of the first page table (the root) is in a register.
If you have the amount of time you can calculate the speed-up by comparing it to the TLB access time.

On TLB Hit 80% your required to access time 2ns and to access that page in main memory required 20ns therefore one part is
0.8×(2+20)
On TLB miss i.e. (1-0.8) 20% for that you are checking TLB again so required 2ns when it is TLB miss it will check into Page Table but base Address of Page Table is into Main Memory so it requires 20ns and when it searches into PT it will getting desired Frame and again required memory access time to access data from main memory so miss calculation is
0.2×(2+20+20)
From above 2 :
Effective access time=0.8×(2+20)+0.2×(2+20+20)
= 26ns

Does virtual address matching matter in shared mem IPC?

I'm implementing IPC between two processes on the same machine (Linux x86_64 shmget and friends), and I'm trying to maximize the throughput of the data between the processes: for example I have restricted the two processes to only run on the same CPU, so as to take advantage of hardware caching.
My question is, does it matter where in the virtual address space each process puts the shared object? For example would it be advantageous to map the object to the same location in both processes? Why or why not?

It doesn't matter as long as the OS is concerned. It would have been advantageous to use the same base address in both processes if the TLB cache wasn't flushed between context switches. The Translation Lookaside Buffer (TLB) cache is a small buffer that caches virtual to physical address translations for individual pages in order to reduce the number of expensive memory reads from the process page table. Whenever a context switch occurs, the TLB cache is flushed - you don't want processes to be able to read a small portion of the memory of other processes, just because its page table entries are still cached in the TLB.
Context switch does not occur between processes running on different cores. But then each core has its own TLB cache and its content is completely uncorrelated with the content of the TLB cache of the other core. TLB flush does not occur when switching between threads from the same process. But threads share their whole virtual address space nevertheless.
It only makes sense to attach the shared memory segment at the same virtual address if you pass around absolute pointers to areas inside it. Imagine, for example, a linked list structure in shared memory. The usual practice is to use offsets from the beginning of the block instead of aboslute pointers. But this is slower as it involves additional pointer arithmetic. That's why you might get better performance with absolute pointers, but finding a suitable place in the virtual address space of both processes might not be an easy task (at least not doing it in a portable way), even on platforms with vast VA spaces like x86-64.

I'm not an expert here, but seeing as there are no other answers I will give it a go. I don't think it will really make a difference, because the virutal address does not necessarily correspond to the physical address. Said another way, the underlying physical address the OS maps your virtual address to is not dependent on the virtual address the OS gives you.
Again, I'm not a memory master. Sorry if I am way off here.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio