Suppose there is 50 MB to allocate and there is this for loop which allocates memory in each iteration. What happens when the following for loop runs.
for(int i=0; i < 20; i++)
{
int *p = malloc(5MB);
}
I was asked this question in an interview. Can someone guide me on this and direct me to the necessary topics which I must learn in order to understand such situations.
If this is a system utilising virtual memory then the situation is obviously more complex than malloc simply failing and returning null pointers. In this case the malloc call will result in allocation of pages of memory in the virtual address space. When this memory is accessed a page fault will result, control will be given to the os memory manager, and this will map the virtual memory pages to pages of physical memory. When the physical memory available is full then the memory manager will generally handle further page faults by writing data which is currently in physical memory into disk backing (or simply discarding this data if it is already backed by a disk file), and then remapping this now available physical memory for the virtual memory page which originally resulted in the page fault. Any attempts to access virtual address pages which have previously been written out to disk backing will result in a similar process occurring.
Wikipedia contains a reasonably basic overview of this process (http://en.wikipedia.org/wiki/Paging), including some implementation details for different os's. More detailed information is available from a number of other sources, eg. the intel architectures software development manuals (http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
According to documentation for malloc (eg. http://www.cplusplus.com/reference/cstdlib/malloc/) you will get a pointer to a memory block of 5MB (noting that the argument for malloc should be the number of bytes required, not MB..) the first 10 times (maybe 9 times if there is overhead associated with the malloc'ed space which is taken out of the 50MB available), and a pointer to this space will be returned. Following this further 5MB chunks of memory will not be available, and malloc will fail, returning a null pointer.
Related
When the operating system loads Program onto the main memory , it , along with the stack and heap memory , also attaches the static data along with it. I googled about what is present in the static data which said it contained the global variables and static variables. But I am confused as both of these are already present in the text file of the program then why do we add them seperately?
The data in the executable is often referred as the data segment. The CPU doesn't interact with the hard-disk but only with RAM. The data segment must thus be loaded in RAM before the CPU can access it. The file of the executable is not really a text file. It is an executable so it has a different extension. Text files often refer to an actual file with a .txt extension.
With that said, you also asked another question not long ago (If the amount of stack memory provided to a program is fixed then why does it grow downwards in the process architecture? Or am I getting it wrong?) so I will try to give some insight for both of these in this same answer.
I don't know much about caching and low level inner CPU workings but, today mostly, the CPU doesn't even operate on RAM directly. It will load a bunch of RAM chunks into the cache and make operations on them and keep RAM-cache consistency by implementing complex mechanisms. The OS also has its role to play in RAM-cache consistency but, like I said, I am far from an expert here. Other than that, caching is mostly transparent to the OS. The CPU handles it and the OS simply provides instructions to the CPU which executes them.
Today, you have paging used by most OS and implemented on most CPU architectures. With paging, every process sees a full contiguous virtual address space. The virtual address space is accessed contiguously and the hardware MMU translates those addresses to physical ones automatically by crossing the page tables. The OS is responsible to make sure the page tables are consistent and the MMU does the rest of the job (for more info read: What is paging exactly? OSDEV). If you understand paging well, things become much clearer.
For a process, there is mostly 3 types of memory. There is the stack (often called automatic storage), the heap and the static/global data. I will attempt to give precision on all of these to give a global picture.
The stack is given a maximum size when the process begins. The OS handles that and creates the page tables and places the proper address in the stack pointer register so that stack accesses reach the proper region of physical memory. The stack is automatic storage which means that it isn't handled manually by the high level programmer. For example, in C/C++, the stack is managed by the compiler which, at the entry of a function, will create a stack frame and place offsets from the stack base pointer in the instructions. Every local variable (within a function) will be accessed with a relative negative offset from the stack base pointer. What the compiler needs to do is to create a stack frame of the proper size so that there will be enough place for all local variables of a particular function (for more info on the stack see: Each program allocates a fixed stack size? Who defines the amount of stack memory for each application running?).
For the heap, the OS reserves a very big amount of virtual memory. Today, virtual memory is very big (2^48 bytes or more). The amount of heap available for each process is often only limited by the amount of physical memory available to back virtual memory allocations. For example, a process could use malloc() to allocate 4KB of memory in C. The OS will be called with a system call by the libc library which is an implementation of the C standard library. The OS will then reserve a page of the virtual memory available for the heap and change the page tables so that accessing that portion of virtual memory will translate to somewhere in RAM (probably somewhere another process wasn't already using).
The static/global data are simply placed in the executable in the data segment. The data segment is loaded in the virtual memory alongside the text segment. The text segment will thus be able to access this data often using RIP-relative addressing.
I understand that each process has their own, separate heap unlike threads (which share a common heap, which thus slows heap memory allocation down as functions like malloc need to use locks for synchronization). However, how does it get decided where, and how much, memory is given to each process, and how is it ensured that this does not conflict with the memory allocated to other processes?
I have not been able to find a definitive answer on this through searching, but if one exists, please provide a link as I would greatly appreciate it. Thank you!
In order to answer the question, you need to understand about virtual memory. In virtual memory, the memory is contiguous as to what user processes can see. The heap is given a very big about of the virtual memory which is limited only by the amount of physical RAM and swap space to back the allocations. In itself the process only sees a contiguous virtual address space. On Linux, the memory allocations are done using the buddy algorithm and the kernel keeps a page struct for every page. The page struct along with the memory map of the process in the task_struct thus allows the Linux kernel to follow what page is free and which isn't.
In linux if malloc can't allocate data chunk from preallocated page. it uses mmap() or brk() to assign new pages. I wanted to clarify a few things :
I don't understand the following statment , I thought that when I use mmap() or brk() the kernel maps me a whole page. but the allocator alocates me only what I asked for from that new page? In this case trying to access unallocated space (In the newly mapped page) will not cause a page fault? (I understand that its not recommended )
The libc allocator manages each page: slicing them into smaller blocks, assigning them to processes, freeing them, and so on. For example, if your program uses 4097 bytes total, you need to use two pages, even though in reality the allocator gives you somewhere between 4105 to 4109 bytes
How does the allocator knows the VMA bounders ?(I assume no system call used) because the VMA struct that hold that information can only get accessed from kernel mode?
The system-level memory allocation (via mmap and brk) is all page-aligned and page-sized. This means if you use malloc (or other libc API that allocates memory) some small amount of memory (say 10 bytes), you are guaranteed that all the other bytes on that page of memory are readable without triggering a page fault.
Malloc and family do their own bookkeeping within the pages returned from the OS, so the mmap'd pages used by libc also contain a bunch of malloc metadata, in addition to whatever space you've allocated.
The libc allocator knows where everything is because it invokes the brk() and mmap() calls. If it invokes mmap() it passes in a size, and the kernel returns a start address. Then the libc allocator just stores those values in its metadata.
Doug Lea's malloc implementation is a very, very well documented memory allocator implementation and its comments will shed a lot of light on how allocators work in general:
http://g.oswego.edu/dl/html/malloc.html
When a process requests physical memory pages from the Linux kernel, the kernel does its best to provide a block of pages that are physically contiguous in memory. I was wondering why it matters that the pages are PHYSICALLY contiguous; after all, the kernel can obscure this fact by simply providing pages that are VIRTUALLY contiguous.
Yet the kernel certainly tries its hardest to provide pages that are PHYSICALLY contiguous, so I'm trying to figure out why physical contiguity matters so much. I did some research &, across a few sources, uncovered the following reasons:
1) makes better use of the cache & achieves lower avg memory access times (GigaQuantum: I don’t understand: how?)
2) you have to fiddle with the kernel page tables in order to map pages that AREN’T physically contiguous (GigaQuantum: I don’t understand this one: isn’t each page mapped separately? What fiddling has to be done?)
3) mapping pages that aren’t physically contiguous leads to greater TLB thrashing (GigaQuantum: I don’t understand: how?)
Per the comments I inserted, I don't really understand these 3 reasons. Nor did any of my research sources adequately explain/justify these 3 reasons. Can anyone explain these in a little more detail?
Thanks! Will help me to better understand the kernel...
The main answer really lies in your second point. Typically, when memory is allocated within the kernel, it isn't mapped at allocation time - instead, the kernel maps as much physical memory as it can up-front, using a simple linear mapping. At allocation time it just carves out some of this memory for the allocation - since the mapping isn't changed, it has to already be contiguous.
The large, linear mapping of physical memory is efficient: both because large pages can be used for it (which take up less space for page table entries and less TLB entries), and because altering the page tables is a slow process (so you want to avoid doing this at allocation/deallocation time).
Allocations that are only logically linear can be requested, using the vmalloc() interface rather than kmalloc().
On 64 bit systems the kernel's mapping can encompass the entireity of physical memory - on 32 bit systems (except those with a small amount of physical memory), only a proportion of physical memory is directly mapped.
Actually the behavior of memory allocation you describe is common for many OS kernels and the main reason is kernel physical pages allocator. Typically, kernel has one physical pages allocator that is used for allocation of pages for both kernel space (including pages for DMA) and user space. In kernel space you need continuos memory, because it's expensive (for in-kernel code) to map pages every time you need them. On x86_64, for example, it's completely worthless because kernel can see the whole address space (on 32bit systems there's 4G limitation of virtual address space, so typically top 1G are dedicated to kernel and bottom 3G to user-space).
Linux kernel uses buddy algorithm for page allocation, so that allocation of bigger chunk takes fewer iterations than allocation of smaller chunk (well, smaller chunks are obtained by splitting bigger chunks). Moreover, using of one allocator for both kernel space and user space allows the kernel to reduce fragmentation. Imagine that you allocate pages for user space by 1 page per iteration. If user space needs N pages, you make N iterations. What happens if kernel wants some continuos memory then? How can it build big enough continuos chunk if you stole 1 page from each big chunk and gave them to user space?
[update]
Actually, kernel allocates continuos blocks of memory for user space not as frequently as you might think. Sure, it allocates them when it builds ELF image of a file, when it creates readahead when user process reads a file, it creates them for IPC operations (pipe, socket buffers) or when user passes MAP_POPULATE flag to mmap syscall. But typically kernel uses "lazy" page loading scheme. It gives continuos space of virtual memory to user-space (when user does malloc first time or does mmap), but it doesn't fill the space with physical pages. It allocates pages only when page fault occurs. The same is true when user process does fork. In this case child process will have "read-only" address space. When child modifies some data, page fault occurs and kernel replaces the page in child address space with a new one (so that parent and child have different pages now). Typically kernel allocates only one page in these cases.
Of course there's a big question of memory fragmentation. Kernel space always needs continuos memory. If kernel would allocate pages for user-space from "random" physical locations, it'd be much more hard to get big chunk of continuos memory in kernel after some time (for example after a week of system uptime). Memory would be too fragmented in this case.
To solve this problem kernel uses "readahead" scheme. When page fault occurs in an address space of some process, kernel allocates and maps more than one page (because there's possibility that process will read/write data from the next page). And of course it uses physically continuos block of memory (if possible) in this case. Just to reduce potential fragmentation.
A couple of that I can think of:
DMA hardware often accesses memory in terms of physical addresses. If you have multiple pages worth of data to transfer from hardware, you're going to need a contiguous chunk of physical memory to do so. Some older DMA controllers even require that memory to be located at low physical addresses.
It allows the OS to leverage large pages. Some memory management units allow you to use a larger page size in your page table entries. This allows you to use fewer page table entries (and TLB slots) to access the same quantity of virtual memory. This reduces the likelihood of a TLB miss. Of course, if you want to allocate a 4MB page, you're going to need 4MB of contiguous physical memory to back it.
Memory-mapped I/O. Some devices could be mapped to I/O ranges that require a contiguous range of memory that spans multiple frames.
Contiguous or Non-Contiguous Memory Allocation request from the kernel depends on your application.
E.g. of Contiguous memory allocation: If you require a DMA operation to be performed then you will be requesting the contiguous memory through kmalloc() call as DMA operation requires a memory which is also physically contiguous , as in DMA you will provide only the starting address of the memory chunk and the other device will read or write from that location.
Some of the operation do not require the contiguous memory so you can request a memory chunk through vmalloc() which gives the pointer to non contagious physical memory.
So it is entirely dependent on the application which is requesting the memory.
Please remember that it is a good practice that if you are requesting the contiguous memory than it should be need based only as kernel is trying best to allocation the memory which is physically contiguous.Well kmalloc() and vmalloc() has their limits also.
Placing things we are going to be reading a lot physically close together takes advantage of spacial locality, things we need are more likely to be cached.
Not sure about this one
I believe this means if pages are not contiguous, the TLB has to do more work to find out where they all are. If they are contigous, we can express all the pages for a processes as PAGES_START + PAGE_OFFSET. If they aren't, we need to store a seperate index for all of the pages of a given processes. Because the TLB has a finite size and we need to access more data, this means we will be swapping in and out a lot more.
kernel does not need physically contiguous pages actually it just needs efficencies ans stabilities.
monolithic kernel tends to have one page table for kernel space shared among processes
and does not want page faults on kernel space that makes kernel designs too complex
so usual implementations on 32 bit architecture is always 3g/1g split for 4g address space
for 1g kernel space, normal mappings of code and data should not generate recursive page faults that is too complex to manage:
you need to find empty page frames, create mapping on mmu, and handle tlb flush for new mappings on every kernel side page fault
kernel is already busy of doing user side page faults
furthermore, 1:1 linear mapping could have much less page table entries because it can utilize bigger size of page unit (>4kb)
less entries leads to less tlb misses.
so buddy allocator on kernel linear address space always provides physically contiguous page frames
even most codes doesn't need contiguous frames
but many device drivers which need contiguous page frames already believe that allocated buffers through general kernel allocator are physically contiguous
Here's my question: Does calling free or delete ever release memory back to the "system". By system I mean, does it ever reduce the data segment of the process?
Let's consider the memory allocator on Linux, i.e ptmalloc.
From what I know (please correct me if I am wrong), ptmalloc maintains a free list of memory blocks and when a request for memory allocation comes, it tries to allocate a memory block from this free list (I know, the allocator is much more complex than that but I am just putting it in simple words). If, however, it fails, it gets the memory from the system using say sbrk or brk system calls. When a memory is free'd, that block is placed in the free list.
Now consider this scenario, on peak load, a lot of objects have been allocated on heap. Now when the load decreases, the objects are free'd. So my question is: Once the object is free'd will the allocator do some calculations to find whether it should just keep this object in the free list or depending upon the current size of the free list it may decide to give that memory back to the system i.e decrease the data segment of the process using sbrk or brk?
Documentation of glibc tells me that if the allocation request is much larger than page size, it will be allocated using mmap and will be directly released back to the system once free'd. Cool. But let's say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what?
From what I know (correct me please), a memory allocated with malloc will never be released back to the system ever until the process ends i.e. the allocator will simply keep it in the free list if I free it. But the question that is troubling me is then, if I use a tool to see the memory usage of my process (I am using pmap on Linux, what do you guys use?), it should always show the memory used at peak load (as the memory is never given back to the system, except when allocated using mmap)? That is memory used by the process should never ever decrease(except the stack memory)? Is it?
I know I am missing something, so please shed some light on all this.
Experts, please clear my concepts regarding this. I will be grateful. I hope I was able to explain my question.
There isn't much overhead for malloc, so you are unlikely to achieve any run-time savings. There is, however, a good reason to implement an allocator on top of malloc, and that is to be able to trace memory leaks. For example, you can free all memory allocated by the program when it exits, and then check to see if your memory allocator calls balance (i.e. same number of calls to allocate/deallocate).
For your specific implementation, there is no reason to free() since the malloc won't release to system memory and so it will only release memory back to your own allocator.
Another reason for using a custom allocator is that you may be allocating many objects of the same size (i.e you have some data structure that you are allocating a lot). You may want to maintain a separate free list for this type of object, and free/allocate only from this special list. The advantage of this is that it will avoid memory fragmentation.
No.
It's actually a bad strategy for a number of reasons, so it doesn't happen --except-- as you note, there can be an exception for large allocations that can be directly made in pages.
It increases internal fragmentation and therefore can actually waste memory. (You can only return aligned pages to the OS, so pulling aligned pages out of a block will usually create two guaranteed-to-be-small blocks --smaller than a page, anyway-- to either side of the block. If this happens a lot you end up with the same total amount of usefully-allocated memory plus lots of useless small blocks.)
A kernel call is required, and kernel calls are slow, so it would slow down the program. It's much faster to just throw the block back into the heap.
Almost every program will either converge on a steady-state memory footprint or it will have an increasing footprint until exit. (Or, until near-exit.) Therefore, all the extra processing needed by a page-return mechanism would be completely wasted.
It is entirely implementation dependent. On Windows VC++ programs can return memory back to the system if the corresponding memory pages contain only free'd blocks.
I think that you have all the information you need to answer your own question. pmap shows the memory that is currenly being used by the process. So, if you call pmap before the process achieves peak memory, then no it will not show peak memory. if you call pmap just before the process exits, then it will show peak memory for a process that does not use mmap. If the process uses mmap, then if you call pmap at the point where maximum memory is being used, it will show peak memory usage, but this point may not be at the end of the process (it could occur anywhere).
This applies only to your current system (i.e. based on the documentation you have provided for free and mmap and malloc) but as the previous poster has stated, behavior of these is implmentation dependent.
This varies a bit from implementation to implementation.
Think of your memory as a massive long block, when you allocate to it you take a bit out of your memory (labeled '1' below):
111
If I allocate more more memory with malloc it gets some from the system:
1112222
If I now free '1':
___2222
It won't be returned to the system, because two is in front of it (and memory is given as a continous block). However if the end of the memory is freed, then that memory is returned to the system. If I freed '2' instead of '1'. I would get:
111
the bit where '2' was would be returned to the system.
The main benefit of freeing memory is that that bit can then be reallocated, as opposed to getting more memory from the system. e.g:
33_2222
I believe that the memory allocator in glibc can return memory back to the system, but whether it will or not depends on your memory allocation patterns.
Let's say you do something like this:
void *pointers[10000];
for(i = 0; i < 10000; i++)
pointers[i] = malloc(1024);
for(i = 0; i < 9999; i++)
free(pointers[i]);
The only part of the heap that can be safely returned to the system is the "wilderness chunk", which is at the end of the heap. This can be returned to the system using another sbrk system call, and the glibc memory allocator will do that when the size of this last chunk exceeds some threshold.
The above program would make 10000 small allocations, but only free the first 9999 of them. The last one should (assuming nothing else has called malloc, which is unlikely) be sitting right at the end of the heap. This would prevent the allocator from returning any memory to the system at all.
If you were to free the remaining allocation, glibc's malloc implementation should be able to return most of the pages allocated back to the system.
If you're allocating and freeing small chunks of memory, a few of which are long-lived, you could end up in a situation where you have a large chunk of memory allocated from the system, but you're only using a tiny fraction of it.
Here are some "advantages" to never releasing memory back to the system:
Having already used a lot of memory makes it very likely you will do so again, and
when you release memory the OS has to do quite a bit of paperwork
when you need it again, your memory allocator has to re-initialise all its data structures in the region it just received
Freed memory that isn't needed gets paged out to disk where it doesn't actually make that much difference
Often, even if you free 90% of your memory, fragmentation means that very few pages can actually be released, so the effort required to look for empty pages isn't terribly well spent
Many memory managers can perform TRIM operations where they return entirely unused blocks of memory to the OS. However, as several posts here have mentioned, it's entirely implementation dependent.
But lets say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what ?
This depends on your allocation pattern. Do you free ALL of the small allocations? If so and if the memory manager has handling for a small block allocations, then this may be possible. However, if you allocate many small items and then only free all but a few scattered items, you may fragment memory and make it impossible to TRIM blocks since each block will have only a few straggling allocations. In this case, you may want to use a different allocation scheme for the temporary allocations and the persistant ones so you can return the temporary allocations back to the OS.