Why is the kernel concerned about issuing PHYSICALLY contiguous pages? - linux-kernel

When a process requests physical memory pages from the Linux kernel, the kernel does its best to provide a block of pages that are physically contiguous in memory. I was wondering why it matters that the pages are PHYSICALLY contiguous; after all, the kernel can obscure this fact by simply providing pages that are VIRTUALLY contiguous.
Yet the kernel certainly tries its hardest to provide pages that are PHYSICALLY contiguous, so I'm trying to figure out why physical contiguity matters so much. I did some research &, across a few sources, uncovered the following reasons:
1) makes better use of the cache & achieves lower avg memory access times (GigaQuantum: I don’t understand: how?)
2) you have to fiddle with the kernel page tables in order to map pages that AREN’T physically contiguous (GigaQuantum: I don’t understand this one: isn’t each page mapped separately? What fiddling has to be done?)
3) mapping pages that aren’t physically contiguous leads to greater TLB thrashing (GigaQuantum: I don’t understand: how?)
Per the comments I inserted, I don't really understand these 3 reasons. Nor did any of my research sources adequately explain/justify these 3 reasons. Can anyone explain these in a little more detail?
Thanks! Will help me to better understand the kernel...

The main answer really lies in your second point. Typically, when memory is allocated within the kernel, it isn't mapped at allocation time - instead, the kernel maps as much physical memory as it can up-front, using a simple linear mapping. At allocation time it just carves out some of this memory for the allocation - since the mapping isn't changed, it has to already be contiguous.
The large, linear mapping of physical memory is efficient: both because large pages can be used for it (which take up less space for page table entries and less TLB entries), and because altering the page tables is a slow process (so you want to avoid doing this at allocation/deallocation time).
Allocations that are only logically linear can be requested, using the vmalloc() interface rather than kmalloc().
On 64 bit systems the kernel's mapping can encompass the entireity of physical memory - on 32 bit systems (except those with a small amount of physical memory), only a proportion of physical memory is directly mapped.

Actually the behavior of memory allocation you describe is common for many OS kernels and the main reason is kernel physical pages allocator. Typically, kernel has one physical pages allocator that is used for allocation of pages for both kernel space (including pages for DMA) and user space. In kernel space you need continuos memory, because it's expensive (for in-kernel code) to map pages every time you need them. On x86_64, for example, it's completely worthless because kernel can see the whole address space (on 32bit systems there's 4G limitation of virtual address space, so typically top 1G are dedicated to kernel and bottom 3G to user-space).
Linux kernel uses buddy algorithm for page allocation, so that allocation of bigger chunk takes fewer iterations than allocation of smaller chunk (well, smaller chunks are obtained by splitting bigger chunks). Moreover, using of one allocator for both kernel space and user space allows the kernel to reduce fragmentation. Imagine that you allocate pages for user space by 1 page per iteration. If user space needs N pages, you make N iterations. What happens if kernel wants some continuos memory then? How can it build big enough continuos chunk if you stole 1 page from each big chunk and gave them to user space?
[update]
Actually, kernel allocates continuos blocks of memory for user space not as frequently as you might think. Sure, it allocates them when it builds ELF image of a file, when it creates readahead when user process reads a file, it creates them for IPC operations (pipe, socket buffers) or when user passes MAP_POPULATE flag to mmap syscall. But typically kernel uses "lazy" page loading scheme. It gives continuos space of virtual memory to user-space (when user does malloc first time or does mmap), but it doesn't fill the space with physical pages. It allocates pages only when page fault occurs. The same is true when user process does fork. In this case child process will have "read-only" address space. When child modifies some data, page fault occurs and kernel replaces the page in child address space with a new one (so that parent and child have different pages now). Typically kernel allocates only one page in these cases.
Of course there's a big question of memory fragmentation. Kernel space always needs continuos memory. If kernel would allocate pages for user-space from "random" physical locations, it'd be much more hard to get big chunk of continuos memory in kernel after some time (for example after a week of system uptime). Memory would be too fragmented in this case.
To solve this problem kernel uses "readahead" scheme. When page fault occurs in an address space of some process, kernel allocates and maps more than one page (because there's possibility that process will read/write data from the next page). And of course it uses physically continuos block of memory (if possible) in this case. Just to reduce potential fragmentation.

A couple of that I can think of:
DMA hardware often accesses memory in terms of physical addresses. If you have multiple pages worth of data to transfer from hardware, you're going to need a contiguous chunk of physical memory to do so. Some older DMA controllers even require that memory to be located at low physical addresses.
It allows the OS to leverage large pages. Some memory management units allow you to use a larger page size in your page table entries. This allows you to use fewer page table entries (and TLB slots) to access the same quantity of virtual memory. This reduces the likelihood of a TLB miss. Of course, if you want to allocate a 4MB page, you're going to need 4MB of contiguous physical memory to back it.
Memory-mapped I/O. Some devices could be mapped to I/O ranges that require a contiguous range of memory that spans multiple frames.

Contiguous or Non-Contiguous Memory Allocation request from the kernel depends on your application.
E.g. of Contiguous memory allocation: If you require a DMA operation to be performed then you will be requesting the contiguous memory through kmalloc() call as DMA operation requires a memory which is also physically contiguous , as in DMA you will provide only the starting address of the memory chunk and the other device will read or write from that location.
Some of the operation do not require the contiguous memory so you can request a memory chunk through vmalloc() which gives the pointer to non contagious physical memory.
So it is entirely dependent on the application which is requesting the memory.
Please remember that it is a good practice that if you are requesting the contiguous memory than it should be need based only as kernel is trying best to allocation the memory which is physically contiguous.Well kmalloc() and vmalloc() has their limits also.

Placing things we are going to be reading a lot physically close together takes advantage of spacial locality, things we need are more likely to be cached.
Not sure about this one
I believe this means if pages are not contiguous, the TLB has to do more work to find out where they all are. If they are contigous, we can express all the pages for a processes as PAGES_START + PAGE_OFFSET. If they aren't, we need to store a seperate index for all of the pages of a given processes. Because the TLB has a finite size and we need to access more data, this means we will be swapping in and out a lot more.

kernel does not need physically contiguous pages actually it just needs efficencies ans stabilities.
monolithic kernel tends to have one page table for kernel space shared among processes
and does not want page faults on kernel space that makes kernel designs too complex
so usual implementations on 32 bit architecture is always 3g/1g split for 4g address space
for 1g kernel space, normal mappings of code and data should not generate recursive page faults that is too complex to manage:
you need to find empty page frames, create mapping on mmu, and handle tlb flush for new mappings on every kernel side page fault
kernel is already busy of doing user side page faults
furthermore, 1:1 linear mapping could have much less page table entries because it can utilize bigger size of page unit (>4kb)
less entries leads to less tlb misses.
so buddy allocator on kernel linear address space always provides physically contiguous page frames
even most codes doesn't need contiguous frames
but many device drivers which need contiguous page frames already believe that allocated buffers through general kernel allocator are physically contiguous

Related

Is MAP_HUGETLB synonym of coherent memory? (when successful)

Am I correct to assume the mmap'd memory using MAP_HUGETLB|MAP_ANONYMOUS is actually 100% physically coherent? at least on the huge page size, 2MB or 1GB.
Otherwise I don't know how it could work/be performant since the TLB would need more entries...
Yes, they are. Indeed, as you point out, in case they weren't, multiple page table entries would be needed for a single huge page, which would defeat the entire purpose of having a huge page.
Here's an excerpt from Documentation/admin-guide/mm/hugetlbpage.rst:
The default for the allowed nodes--when the task has default memory
policy--is all on-line nodes with memory. Allowed nodes with
insufficient available, contiguous memory for a huge page will be
silently skipped when allocating persistent huge pages. See the
discussion below <mem_policy_and_hp_alloc> of the interaction
of task memory policy, cpusets and per node attributes with the
allocation and freeing of persistent huge pages.
The success or failure of huge page allocation depends on the amount of physically contiguous memory that is present in system at the time
of the allocation attempt. If the kernel is unable to allocate huge
pages from some nodes in a NUMA system, it will attempt to make up the
difference by allocating extra pages on other nodes with sufficient
available contiguous memory, if any.
See also: How do I allocate a DMA buffer backed by 1GB HugePages in a linux kernel module?

virtual memory effects and relations between paging and segmentation

This is my first post. I want to ask about how are virtual memory related to paging and segmentation. I am searching internet for few days, but still can't manage to put that information into right order. Here is what I know so far:
We can talk about addresses (we could say they are levels of memory abstraction) in memory:
physical level (CPU talking to memory controller, "hey give me contents of address 0xFFEABCD", these adresses are adresses of cells in RAM, so cell 0xABCD has physical address 0xABCD. memory controller can only use physical adresses, so if adress is not physical it must be changed to physical.
logical level.This is abstraction over physical addresses. Here processes if ask for memory, (assume successfull allocation) are given address which has no direct relation to cells in RAM. We can say these addresses are from different pool (world?) than physical addresses. As I said before memory controller only understand physical adresses, so to use logical addresses, we need to convert them to physical addresses. There are two ways for OS to be able to create logical adresses:
paging - in which physical memory (RAM) is divided into continous blocks of memory (called frames), and logical memory (this other world) is also divided in same in length blocks (called pages). Now OS keep in RAM data structure called page table. It's an associative array (map) and it's primary goal of existence is to translate logical level addresses to physical level adresses. Paging has following effect: memory allocated by process in RAM (so in frames in physical memory belonging to program) may not be in contingous manner (so there may be holes inside).
segmentation - program is divided into parts called segments. Segments sizes are not fixed, so different segments may have different sizes. Program is divided in few segments and each segment will have its own place in RAM (physical) memory. So one segment (call it sementA), and another (call it segmentB) may not be near each other. In other words segmentA don't have to has segmentB as a neighbour.
internal fragmentation - when memory which belongs to process isn't used in 100%. So if process want to have 2 bytes for its use, OS need to allocate page/pages which total size need to be greater or equal than amount of memory requested by program. Typical size of page is 4KB. Unit in which OS gives memory to process are pages. So it can't give less than 4KB. So if we use 2 bytes, 4KB - 2B = 4094 bytes are wasted (memory is associated with our process so other processes can't use it. Only we can use it, but we only need 2B).
external fragmentation - when allocated blocks of memory are one near another, but there is a little hole between them. Its free, so other programs, can use it, but it is unlikly because it is very small. That holes with high probability will be wasted. More holes - more wasted memory.
Paging may cause effect of internal fragmentation. Segmentation may cause effect of external fragmentation.
virtual level - addresses used in virtual memory. This is extension of logical memory level. Now program don't even need to have all of it's allocated pages in RAM to start execution. It can be implemented with following techniques:
paged segmentation - method in which segments are divided into pages.
segmented paging - less used method but also possible.
Combining them takes a positive aspects from both solutions.
What i have read about pros and cons of virtual memory:
PROS:
processes have their own address space which mean if we have two processes A and B, and both of them have a pointer to address eg. 17 processA pointer will be showing to different frame than pointer in processB. this results in greater process isolation. Processes are protected from each other (so one process can't do things with another process memory if it isn't shared memory because in its mapping don't exist such mapping entry), and OS is more protected from processes.
have more memory than you physical first order memory(RAM, due to swapping to secondary order memory).
better use of memory due to:
swapping unused parts of programs to secondary memory.
making sharings pages possible, also make possible "copy on write".
improved multiprogram capability (when not needed parts of programs are swapped out to secondary memory, they made free space in ram which could be used for new procesess.)
improved CPU utilisation (if you can have more processes loaded into memory you have bigger probability than there exist some program that now need do CPU stuff, not IO stuff. In such cases you can better utilise CPU).
CONS:
virtual memory has it's overhead because we need to get access to memory twice (but here a lot of improvment can be achieved using TLB buffers)
it makes OS part managing memory more complicated.
So here we came to parts which I don't really understand:
Why in some sources logical address and virtual addresses are described as synonymes? Do I get something wrong?
Is really virtual memory making protection to processes? I mean, in segmentation for example there was also check if process do not acces other memory (resulting in segfault if it does), paging also has a protection bit in a page table, so doesn't the protection come from simply extending abstraction of logic level addresses? If VM (Virtual Memory) brings extended protection features, what are they and how they work? In other words: does creating separate address space for each process, bring extended memory protection. If so, what can't be achieved is paging without VM?
How really differ paged segmentation from segmented paging. I know that the difference between these two will be how a address is constructed (a page number, segment number, that stuff..), but I suppose it isn't enough to develop 2 strategies. This reason is like nothing. I read that segmented paging is less elastic, and that's the reason why it is rarely used. But why it it less elastic? Is the reason for that, that in program you can have only few segments instead a lot of pages. If thats the case paging indeed allow better "granularity".
If VM make separate address space for each process, does it mean, paging without VM use logic addresses from "one pool" (is then every logic address globally unique in that case?).
Any help on that topic would be appreciated.
Edit: #1
Ok. I finally understood that paging not on demand is also a virtual memory. I just found some clarification was helpful to understand the topic. Below is link to image which I made to visualize differences. Thanks for help.
differences between paging, demand paging and swapping
Why in some sources logical address and virtual addresses are described as synonymes? Do I get something wrong?
Many sources conflate logical and virtual memory translation. In ye olde days, logical address translation never took place without virtual address translation so processor documentation referred to them as the same.
Now we have large memory systems that use logical memory translation without virtual memory.
Is really virtual memory making protection to processes?
It is the logical memory translation that implements page protections.
How really differ paged segmentation from segmented paging.
You can really ignore segments. No rationally designed processor architecture designed after 1970 used segments and they are finally dying out.
If VM make separate address space for each process, does it mean, paging without VM use logic addresses from "one pool"
It is logical memory that creates the separate address space for each process. Paging is virtual memory. You cannot have one without the other.

What are the differences between virtual memory and physical memory?

I am often confused with the concept of virtualization in operating systems. Considering RAM as the physical memory, why do we need the virtual memory for executing a process?
Where does this virtual memory stand when the process (program) from the external hard drive is brought to the main memory (physical memory) for the execution.
Who takes care of the virtual memory and what is the size of the virtual memory?
Suppose if the size of the RAM is 4GB (i.e. 2^32-1 address spaces) what is the size of the virtual memory?
Softwares run on the OS on a very simple premise - they require memory. The device OS provides it in the form of RAM. The amount of memory required may vary - some softwares need huge memory, some require paltry memory. Most (if not all) users run multiple applications on the OS simultaneously, and given that memory is expensive (and device size is finite), the amount of memory available is always limited. So given that all softwares require a certain amount of RAM, and all of them can be made to run at the same time, OS has to take care of two things:
That the software always runs until user aborts it, i.e. it should not auto-abort because OS has run out of memory.
The above activity, while maintaining a respectable performance for the softwares running.
Now the main question boils down to how the memory is being managed. What exactly governs where in the memory will the data belonging to a given software reside?
Possible solution 1: Let individual softwares specify explicitly the memory address they will use in the device. Suppose Photoshop declares that it will always use memory addresses ranging from 0 to 1023 (imagine the memory as a linear array of bytes, so first byte is at location 0, 1024th byte is at location 1023) - i.e. occupying 1 GB memory. Similarly, VLC declares that it will occupy memory range 1244 to 1876, etc.
Advantages:
Every application is pre-assigned a memory slot, so when it is installed and executed, it just stores its data in that memory area, and everything works fine.
Disadvantages:
This does not scale. Theoretically, an app may require a huge amount of memory when it is doing something really heavy-duty. So to ensure that it never runs out of memory, the memory area allocated to it must always be more than or equal to that amount of memory. What if a software, whose maximal theoretical memory usage is 2 GB (hence requiring 2 GB memory allocation from RAM), is installed in a machine with only 1 GB memory? Should the software just abort on startup, saying that the available RAM is less than 2 GB? Or should it continue, and the moment the memory required exceeds 2 GB, just abort and bail out with the message that not enough memory is available?
It is not possible to prevent memory mangling. There are millions of softwares out there, even if each of them was allotted just 1 kB memory, the total memory required would exceed 16 GB, which is more than most devices offer. How can, then, different softwares be allotted memory slots that do not encroach upon each other's areas? Firstly, there is no centralized software market which can regulate that when a new software is being released, it must assign itself this much memory from this yet unoccupied area, and secondly, even if there were, it is not possible to do it because the no. of softwares is practically infinite (thus requiring infinite memory to accommodate all of them), and the total RAM available on any device is not sufficient to accommodate even a fraction of what is required, thus making inevitable the encroaching of the memory bounds of one software upon that of another. So what happens when Photoshop is assigned memory locations 1 to 1023 and VLC is assigned 1000 to 1676? What if Photoshop stores some data at location 1008, then VLC overwrites that with its own data, and later Photoshop accesses it thinking that it is the same data is had stored there previously? As you can imagine, bad things will happen.
So clearly, as you can see, this idea is rather naive.
Possible solution 2: Let's try another scheme - where OS will do majority of the memory management. Softwares, whenever they require any memory, will just request the OS, and the OS will accommodate accordingly. Say OS ensures that whenever a new process is requesting for memory, it will allocate the memory from the lowest byte address possible (as said earlier, RAM can be imagined as a linear array of bytes, so for a 4 GB RAM, the addresses range for a byte from 0 to 2^32-1) if the process is starting, else if it is a running process requesting the memory, it will allocate from the last memory location where that process still resides. Since the softwares will be emitting addresses without considering what the actual memory address is going to be where that data is stored, OS will have to maintain a mapping, per software, of the address emitted by the software to the actual physical address (Note: that is one of the two reasons we call this concept Virtual Memory. Softwares are not caring about the real memory address where their data are getting stored, they just spit out addresses on the fly, and the OS finds the right place to fit it and find it later if required).
Say the device has just been turned on, OS has just launched, right now there is no other process running (ignoring the OS, which is also a process!), and you decide to launch VLC. So VLC is allocated a part of the RAM from the lowest byte addresses. Good. Now while the video is running, you need to start your browser to view some webpage. Then you need to launch Notepad to scribble some text. And then Eclipse to do some coding.. Pretty soon your memory of 4 GB is all used up, and the RAM looks like this:
Problem 1: Now you cannot start any other process, for all RAM is used up. Thus programs have to be written keeping the maximum memory available in mind (practically even less will be available, as other softwares will be running parallelly as well!). In other words, you cannot run a high-memory consuming app in your ramshackle 1 GB PC.
Okay, so now you decide that you no longer need to keep Eclipse and Chrome open, you close them to free up some memory. The space occupied in RAM by those processes is reclaimed by OS, and it looks like this now:
Suppose that closing these two frees up 700 MB space - (400 + 300) MB. Now you need to launch Opera, which will take up 450 MB space. Well, you do have more than 450 MB space available in total, but...it is not contiguous, it is divided into individual chunks, none of which is big enough to fit 450 MB. So you hit upon a brilliant idea, let's move all the processes below to as much above as possible, which will leave the 700 MB empty space in one chunk at the bottom. This is called compaction. Great, except that...all the processes which are there are running. Moving them will mean moving the address of all their contents (remember, OS maintains a mapping of the memory spat out by the software to the actual memory address. Imagine software had spat out an address of 45 with data 123, and OS had stored it in location 2012 and created an entry in the map, mapping 45 to 2012. If the software is now moved in memory, what used to be at location 2012 will no longer be at 2012, but in a new location, and OS has to update the map accordingly to map 45 to the new address, so that the software can get the expected data (123) when it queries for memory location 45. As far as the software is concerned, all it knows is that address 45 contains the data 123!)! Imagine a process that is referencing a local variable i. By the time it is accessed again, its address has changed, and it won't be able to find it any more. The same will hold for all functions, objects, variables, basically everything has an address, and moving a process will mean changing the address of all of them. Which leads us to:
Problem 2: You cannot move a process. The values of all variables, functions and objects within that process have hardcoded values as
spat out by the compiler during compilation, the process depends on
them being at the same location during its lifetime, and changing them is expensive. As a result,
processes leave behind big "holes" when they exit. This is called
External Fragmentation.
Fine. Suppose somehow, by some miraculous manner, you do manage to move the processes up. Now there is 700 MB of free space at the bottom:
Opera smoothly fits in at the bottom. Now your RAM looks like this:
Good. Everything is looking fine. However, there is not much space left, and now you need to launch Chrome again, a known memory-hog! It needs lots of memory to start, and you have hardly any left...Except.. you now notice that some of the processes, which were initially occupying large space, now is not needing much space. May be you have stopped your video in VLC, hence it is still occupying some space, but not as much as it required while running a high resolution video. Similarly for Notepad and Photos. Your RAM now looks like this:
Holes, once again! Back to square one! Except, previously, the holes occurred due to processes terminating, now it is due to processes requiring less space than before! And you again have the same problem, the holes combined yield more space than required, but they are scattered around, not much of use in isolation. So you have to move those processes again, an expensive operation, and a very frequent one at that, since processes will frequently reduce in size over their lifetime.
Problem 3: Processes, over their lifetime, may reduce in size, leaving behind unused space, which if needed to be used, will require
the expensive operation of moving many processes. This is called
Internal Fragmentation.
Fine, so now, your OS does the required thing, moves processes around and start Chrome and after some time, your RAM looks like this:
Cool. Now suppose you again resume watching Avatar in VLC. Its memory requirement will shoot up! But...there is no space left for it to grow, as Notepad is snuggled at its bottom. So, again, all processes has to move below until VLC has found sufficient space!
Problem 4: If processes needs to grow, it will be a very expensive operation
Fine. Now suppose, Photos is being used to load some photos from an external hard disk. Accessing hard-disk takes you from the realm of caches and RAM to that of disk, which is slower by orders of magnitudes. Painfully, irrevocably, transcendentally slower. It is an I/O operation, which means it is not CPU bound (it is rather the exact opposite), which means it does not need to occupy RAM right now. However, it still occupies RAM stubbornly. If you want to launch Firefox in the meantime, you can't, because there is not much memory available, whereas if Photos was taken out of memory for the duration of its I/O bound activity, it would have freed lot of memory, followed by (expensive) compaction, followed by Firefox fitting in.
Problem 5: I/O bound jobs keep on occupying RAM, leading to under-utilization of RAM, which could have been used by CPU bound jobs in the meantime.
So, as we can see, we have so many problems even with the approach of virtual memory.
There are two approaches to tackle these problems - paging and segmentation. Let us discuss paging. In this approach, the virtual address space of a process is mapped to the physical memory in chunks - called pages. A typical page size is 4 kB. The mapping is maintained by something called a page table, given a virtual address, all now we have to do is find out which page the address belong to, then from the page table, find the corresponding location for that page in actual physical memory (known as frame), and given that the offset of the virtual address within the page is same for the page as well as the frame, find out the actual address by adding that offset to the address returned by the page table. For example:
On the left is the virtual address space of a process. Say the virtual address space requires 40 units of memory. If the physical address space (on the right) had 40 units of memory as well, it would have been possible to map all location from the left to a location on the right, and we would have been so happy. But as ill luck would have it, not only does the physical memory have less (24 here) memory units available, it has to be shared between multiple processes as well! Fine, let's see how we make do with it.
When the process starts, say a memory access request for location 35 is made. Here the page size is 8 (each page contains 8 locations, the entire virtual address space of 40 locations thus contains 5 pages). So this location belongs to page no. 4 (35/8). Within this page, this location has an offset of 3 (35%8). So this location can be specified by the tuple (pageIndex, offset) = (4,3). This is just the starting, so no part of the process is stored in the actual physical memory yet. So the page table, which maintains a mapping of the pages on the left to the actual pages on the right (where they are called frames) is currently empty. So OS relinquishes the CPU, lets a device driver access the disk and fetch the page no. 4 for this process (basically a memory chunk from the program on the disk whose addresses range from 32 to 39). When it arrives, OS allocates the page somewhere in the RAM, say first frame itself, and the page table for this process takes note that page 4 maps to frame 0 in the RAM. Now the data is finally there in the physical memory. OS again queries the page table for the tuple (4,3), and this time, page table says that page 4 is already mapped to frame 0 in the RAM. So OS simply goes to the 0th frame in RAM, accesses the data at offset 3 in that frame (Take a moment to understand this. The entire page, which was fetched from disk, is moved to frame. So whatever the offset of an individual memory location in a page was, it will be the same in the frame as well, since within the page/frame, the memory unit still resides at the same place relatively!), and returns the data! Because the data was not found in memory at first query itself, but rather had to be fetched from disk to be loaded into memory, it constitutes a miss.
Fine. Now suppose, a memory access for location 28 is made. It boils down to (3,4). Page table right now has only one entry, mapping page 4 to frame 0. So this is again a miss, the process relinquishes the CPU, device driver fetches the page from disk, process regains control of CPU again, and its page table is updated. Say now the page 3 is mapped to frame 1 in the RAM. So (3,4) becomes (1,4), and the data at that location in RAM is returned. Good. In this way, suppose the next memory access is for location 8, which translates to (1,0). Page 1 is not in memory yet, the same procedure is repeated, and the page is allocated at frame 2 in RAM. Now the RAM-process mapping looks like the picture above. At this point in time, the RAM, which had only 24 units of memory available, is filled up. Suppose the next memory access request for this process is from address 30. It maps to (3,6), and page table says that page 3 is in RAM, and it maps to frame 1. Yay! So the data is fetched from RAM location (1,6), and returned. This constitutes a hit, as data required can be obtained directly from RAM, thus being very fast. Similarly, the next few access requests, say for locations 11, 32, 26, 27 all are hits, i.e. data requested by the process is found directly in the RAM without needing to look elsewhere.
Now suppose a memory access request for location 3 comes. It translates to (0,3), and page table for this process, which currently has 3 entries, for pages 1, 3 and 4 says that this page is not in memory. Like previous cases, it is fetched from disk, however, unlike previous cases, RAM is filled up! So what to do now? Here lies the beauty of virtual memory, a frame from the RAM is evicted! (Various factors govern which frame is to be evicted. It may be LRU based, where the frame which was least recently accessed for a process is to be evicted. It may be first-come-first-evicted basis, where the frame which allocated longest time ago, is evicted, etc.) So some frame is evicted. Say frame 1 (just randomly choosing it). However, that frame is mapped to some page! (Currently, it is mapped by the page table to page 3 of our one and only one process). So that process has to be told this tragic news, that one frame, which unfortunate belongs to you, is to be evicted from RAM to make room for another pages. The process has to ensure that it updates its page table with this information, that is, removing the entry for that page-frame duo, so that the next time a request is made for that page, it right tells the process that this page is no longer in memory, and has to be fetched from disk. Good. So frame 1 is evicted, page 0 is brought in and placed there in the RAM, and the entry for page 3 is removed, and replaced by page 0 mapping to the same frame 1. So now our mapping looks like this (note the colour change in the second frame on the right side):
Saw what just happened? The process had to grow, it needed more space than the available RAM, but unlike our earlier scenario where every process in the RAM had to move to accommodate a growing process, here it happened by just one page replacement! This was made possible by the fact that the memory for a process no longer needs to be contiguous, it can reside at different places in chunks, OS maintains the information as to where they are, and when required, they are appropriately queried. Note: you might be thinking, huh, what if most of the times it is a miss, and the data has to be constantly loaded from disk into memory? Yes, theoretically, it is possible, but most compilers are designed in such a manner that follows locality of reference, i.e. if data from some memory location is used, the next data needed will be located somewhere very close, perhaps from the same page, the page which was just loaded into memory. As a result, the next miss will happen after quite some time, most of the upcoming memory requirements will be met by the page just brought in, or the pages already in memory which were recently used. The exact same principle allows us to evict the least recently used page as well, with the logic that what has not been used in a while, is not likely to be used in a while as well. However, it is not always so, and in exceptional cases, yes, performance may suffer. More about it later.
Solution to Problem 4: Processes can now grow easily, if space problem is faced, all it requires is to do a simple page replacement, without moving any other process.
Solution to Problem 1: A process can access unlimited memory. When more memory than available is needed, the disk is used as backup, the new data required is loaded into memory from the disk, and the least recently used data frame (or page) is moved to disk. This can go on infinitely, and since disk space is cheap and virtually unlimited, it gives an illusion of unlimited memory. Another reason for the name Virtual Memory, it gives you illusion of memory which is not really available!
Cool. Earlier we were facing a problem where even though a process reduces in size, the empty space is difficult to be reclaimed by other processes (because it would require costly compaction). Now it is easy, when a process becomes smaller in size, many of its pages are no longer used, so when other processes need more memory, a simple LRU based eviction automatically evicts those less-used pages from RAM, and replaces them with the new pages from the other processes (and of course updating the page tables of all those processes as well as the original process which now requires less space), all these without any costly compaction operation!
Solution to Problem 3: Whenever processes reduce in size, its frames in RAM will be less used, so a simple LRU based eviction can evict those pages out and replace them with pages required by new processes, thus avoiding Internal Fragmentation without need for compaction.
As for problem 2, take a moment to understand this, the scenario itself is completely removed! There is no need to move a process to accommodate a new process, because now the entire process never needs to fit at once, only certain pages of it need to fit ad hoc, that happens by evicting frames from RAM. Everything happens in units of pages, thus there is no concept of hole now, and hence no question of anything moving! May be 10 pages had to be moved because of this new requirement, there are thousands of pages which are left untouched. Whereas, earlier, all processes (every bit of them) had to be moved!
Solution to Problem 2: To accommodate a new process, data from only less recently used parts of other processes have to be evicted as required, and this happens in fixed size units called pages. Thus there is no possibility of hole or External Fragmentation with this system.
Now when the process needs to do some I/O operation, it can relinquish CPU easily! OS simply evicts all its pages from the RAM (perhaps store it in some cache) while new processes occupy the RAM in the meantime. When the I/O operation is done, OS simply restores those pages to the RAM (of course by replacing the pages from some other processes, may be from the ones which replaced the original process, or may be from some which themselves need to do I/O now, and hence can relinquish the memory!)
Solution to Problem 5: When a process is doing I/O operations, it can easily give up RAM usage, which can be utilized by other processes. This leads to proper utilization of RAM.
And of course, now no process is accessing the RAM directly. Each process is accessing a virtual memory location, which is mapped to a physical RAM address and maintained by the page-table of that process. The mapping is OS-backed, OS lets the process know which frame is empty so that a new page for a process can be fitted there. Since this memory allocation is overseen by the OS itself, it can easily ensure that no process encroaches upon the contents of another process by allocating only empty frames from RAM, or upon encroaching upon the contents of another process in the RAM, communicate to the process to update it page-table.
Solution to Original Problem: There is no possibility of a process accessing the contents of another process, since the entire allocation is managed by the OS itself, and every process runs in its own sandboxed virtual address space.
So paging (among other techniques), in conjunction with virtual memory, is what powers today's softwares running on OS-es! This frees the software developer from worrying about how much memory is available on the user's device, where to store the data, how to prevent other processes from corrupting their software's data, etc. However, it is of course, not full-proof. There are flaws:
Paging is, ultimately, giving user the illusion of infinite memory by using disk as secondary backup. Retrieving data from secondary storage to fit into memory (called page swap, and the event of not finding the desired page in RAM is called page fault) is expensive as it is an IO operation. This slows down the process. Several such page swaps happen in succession, and the process becomes painfully slow. Ever seen your software running fine and dandy, and suddenly it becomes so slow that it nearly hangs, or leaves you with no option that to restart it? Possibly too many page swaps were happening, making it slow (called thrashing).
So coming back to OP,
Why do we need the virtual memory for executing a process? - As the answer explains at length, to give softwares the illusion of the device/OS having infinite memory, so that any software, big or small, can be run, without worrying about memory allocation, or other processes corrupting its data, even when running in parallel. It is a concept, implemented in practice through various techniques, one of which, as described here, is Paging. It may also be Segmentation.
Where does this virtual memory stand when the process (program) from the external hard drive is brought to the main memory (physical memory) for the execution? - Virtual memory doesn't stand anywhere per se, it is an abstraction, always present, when the software/process/program is booted, a new page table is created for it, and it contains the mapping from the addresses spat out by that process to the actual physical address in RAM. Since the addresses spat out by the process are not real addresses, in one sense, they are, actually, what you can say, the virtual memory.
Who takes care of the virtual memory and what is the size of the virtual memory? - It is taken care of by, in tandem, the OS and the software. Imagine a function in your code (which eventually compiled and made into the executable that spawned the process) which contains a local variable - an int i. When the code executes, i gets a memory address within the stack of the function. That function is itself stored as an object somewhere else. These addresses are compiler generated (the compiler which compiled your code into the executable) - virtual addresses. When executed, i has to reside somewhere in actual physical address for duration of that function at least (unless it is a static variable!), so OS maps the compiler generated virtual address of i into an actual physical address, so that whenever, within that function, some code requires the value of i, that process can query the OS for that virtual address, and OS in turn can query the physical address for the value stored, and return it.
Suppose if the size of the RAM is 4GB (i.e. 2^32-1 address spaces) what is the size of the virtual memory? - The size of the RAM is not related to the size of virtual memory, it depends upon the OS. For example, on 32 bit Windows, it is 16 TB, on 64 bit Windows, it is 256 TB. Of course, it is also limited by the disk size, since that is where the memory is backed up.
Virtual memory is, among other things, an abstraction to give the programmer the illusion of having infinite memory available on their system.
Virtual memory mappings are made to correspond to actual physical addresses. The operating system creates and deals with these mappings - utilizing the page table, among other data structures to maintain the mappings. Virtual memory mappings are always found in the page table or some similar data structure (in case of other implementations of virtual memory, we maybe shouldn't call it the "page table"). The page table is in physical memory as well - often in kernel-reserved spaces that user programs cannot write over.
Virtual memory is typically larger than physical memory - there wouldn't be much reason for virtual memory mappings if virtual memory and physical memory were the same size.
Only the needed part of a program is resident in memory, typically - this is a topic called "paging". Virtual memory and paging are tightly related, but not the same topic. There are other implementations of virtual memory, such as segmentation.
I could be assuming wrong here, but I'd bet the things you are finding hard to wrap your head around have to do with specific implementations of virtual memory, most likely paging. There is no one way to do paging - there are many implementations and the one your textbook describes is likely not the same as the one that appears in real OSes like Linux/Windows - there are probably subtle differences.
I could blab a thousand paragraphs about paging... but I think that is better left to a different question targeting specifically that topic.
I am shamelessly copying the excerpts from man page of top
VIRT -- Virtual Image (kb)
The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been
swapped out and pages that have been mapped but not used.
SWAP -- Swapped size (kb)
Memory that is not resident but is present in a task. This is memory that has been swapped out but could include additional non-
resident memory. This column is calculated by subtracting physical memory from virtual memory
See here: Physical Vs Virtual Memory
Virtual memory is stored on the hard drive and is used when the RAM is filled. Physical memory is limited to the size of the RAM chips installed in the computer. Virtual memory is limited by the size of the hard drive, so virtual memory has the capability for more storage.
Physical memory: Physical memory refers to the RAM or the primary memory in the computer. Physical memory is a volatile memory. Therefore, it requires a continuous flow of power to retain data.
Virtual memory is a logical memory. In other words, it is a memory management technique performed by the operating system. Virtual memory allows the programmer to use more memory for the programs than the available physical memory. If the physical memory is 4GB and the virtual memory is 16GB, the programmer can use the 16GB virtual memory to execute the program. Using virtual memory, he can execute complex programs that require more memory than the physical memory.
The main difference between physical and virtual memory is that the physical memory refers to the actual RAM of the system attached to the motherboard, but the virtual memory is a memory management technique that allows the users to execute programs larger than the actual physical memory.
So, any program typically needs to store/read data from memory (physical address). CPU generates address space that is referred as Logical Address Apace, which is a set of addresses that can be used to access the program in memory, since logical address space corresponds to actual addresses there.
Addresses in logical address space are referred to as Logical or Virtual Addresses, whereas the set of addresses that correspond to them in physical memory are referred to as Physical Address Space, each of them called Physical Address, which is actual address in memory.
Note: MMU or Memory Management Unit is responsible for the address binding that takes place.
Source: "Operating System Concepts" by Abraham Silberschatz (Chapter8 Memory Management Strategies)

Does virtual address matching matter in shared mem IPC?

I'm implementing IPC between two processes on the same machine (Linux x86_64 shmget and friends), and I'm trying to maximize the throughput of the data between the processes: for example I have restricted the two processes to only run on the same CPU, so as to take advantage of hardware caching.
My question is, does it matter where in the virtual address space each process puts the shared object? For example would it be advantageous to map the object to the same location in both processes? Why or why not?
It doesn't matter as long as the OS is concerned. It would have been advantageous to use the same base address in both processes if the TLB cache wasn't flushed between context switches. The Translation Lookaside Buffer (TLB) cache is a small buffer that caches virtual to physical address translations for individual pages in order to reduce the number of expensive memory reads from the process page table. Whenever a context switch occurs, the TLB cache is flushed - you don't want processes to be able to read a small portion of the memory of other processes, just because its page table entries are still cached in the TLB.
Context switch does not occur between processes running on different cores. But then each core has its own TLB cache and its content is completely uncorrelated with the content of the TLB cache of the other core. TLB flush does not occur when switching between threads from the same process. But threads share their whole virtual address space nevertheless.
It only makes sense to attach the shared memory segment at the same virtual address if you pass around absolute pointers to areas inside it. Imagine, for example, a linked list structure in shared memory. The usual practice is to use offsets from the beginning of the block instead of aboslute pointers. But this is slower as it involves additional pointer arithmetic. That's why you might get better performance with absolute pointers, but finding a suitable place in the virtual address space of both processes might not be an easy task (at least not doing it in a portable way), even on platforms with vast VA spaces like x86-64.
I'm not an expert here, but seeing as there are no other answers I will give it a go. I don't think it will really make a difference, because the virutal address does not necessarily correspond to the physical address. Said another way, the underlying physical address the OS maps your virtual address to is not dependent on the virtual address the OS gives you.
Again, I'm not a memory master. Sorry if I am way off here.

What is the maximum addressable space of virtual memory?

Saw this questions asked many times. But couldn't find a reasonable answer. What is actually the limit of virtual memory?
Is it the maximum addressable size of CPU? For example if CPU is 32 bit the maximum is 4G?
Also some texts relates it to hard disk area. But I couldn't find it is a good explanation. Some says its the CPU generated address.
All the address we see are virtual address? For example the memory locations we see when debugging a program using GDB.
The historical reason behind the CPU generating virtual address? Some texts interchangeably use virtual address and logical address. How does it differ?
Unfortunately, the answer is "it depends". You didn't mention an operating system, but you implied linux when you mentioned GDB. I will try to be completely general in my answer.
There are basically three different "address spaces".
The first is logical address space. This is the range of a pointer. Modern (386 or better) have memory management units that allow an operating system to make your actual (physical) memory appear at arbitrary addresses. For a typical desktop machine, this is done in 4KB chunks. When a program accesses memory at some address, the CPU will lookup where what physical address corresponds to that logical address, and cache that in a TLB (translation lookaside buffer). This allows three things: first it allows an operating system to give each process as much address space as it likes (up to the entire range of a pointer - or beyond if there are APIs to allow programs to map/unmap sections of their address space). Second it allows it to isolate different programs entirely, by switching to a different memory mapping, making it impossible for one program to corrupt the memory of another program. Third, it provides developers with a debugging aid - random corrupt pointers may point to some address that hasn't been mapped at all, leading to "segmentation fault" or "invalid page fault" or whatever, terminology varies by OS.
The second address space is physical memory. It is simply your RAM - you have a finite quantity of RAM. There may also be hardware that has memory mapped I/O - devices that LOOK like RAM, but it's really some hardware device like a PCI card, or perhaps memory on a video card, etc.
The third type of address is virtual address space. If you have less physical memory (RAM) than the programs need, the operating system can simulate having more RAM by giving the program the illusion of having a large amount of RAM by only having a portion of that actually being RAM, and the rest being in a "swap file". For example, say your machine has 2MB of RAM. Say a program allocated 4MB. What would happen is the operating system would reserve 4MB of address space. The operating system will try to keep the most recently/frequently accessed pieces of that 4MB in actual RAM. Any sections that are not frequently/recently accessed are copied to the "swap file". Now if the program touches a part of that 4MB that isn't actually in memory, the CPU will generate a "page fault". THe operating system will find some physical memory that hasn't been accessed recently and "page in" that page. It might have to write the content of that memory page out to the page file before it can page in the data being accessed. THis is why it is called a swap file - typically, when it reads something in from the swap file, it probably has to write something out first, effectively swapping something in memory with something on disk.
Typical MMU (memory management unit) hardware keeps track of what addresses are accessed (i.e. read), and modified (i.e. written). Typical paging implementations will often leave the data on disk when it is paged in. This allows it to "discard" a page if it hasn't been modified, avoiding writing out the page when swapping. Typical operating systems will periodically scan the page tables and keep some kind of data structure that allows it to intelligently and quickly choose what piece of physical memory has not been modified, and over time builds up information about what parts of memory change often and what parts don't.
Typical operating systems will often gently page out pages that don't change often (gently because they don't want to generate too much disk I/O which would interfere with your actual work). This allows it to instantly discard a page when a swapping operation needs memory.
Typical operating systems will try to use all the "unused" memory space to "cache" (keep a copy of) pieces of files that are accessed. Memory is thousands of times faster than disk, so if something gets read often, having it in RAM is drastically faster. Typically, a virtual memory implementation will be coupled with this "disk cache" as a source of memory that can be quickly reclaimed for a swapping operation.
Writing an effective virtual memory manager is extremely difficult. It needs to dynamically adapt to changing needs.
Typical virtual memory implementations feel awfully slow. When a machine starts to use far more memory that it has RAM, overall performance gets really, really bad.

Resources