What is the major difference between kmalloc and vmalloc? [duplicate] - linux-kernel

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What is the difference between vmalloc and kmalloc?
Please tell in detail explanation

kmalloc allocates physically contiguous memory, memory which
pages are laid consecutively in physical RAM. vmalloc allocates
memory which is contiguous in kernel virtual memory space (that means
pages allocated that way are not contiguous in RAM, but the kernel
sees them as one block).
kmalloc is the preffered way, as long as you don't need very big
areas. The trouble is, if you want to do DMA from/to some hardware
device, you'll need to use kmalloc, and you'll probably need bigger
chunk. The solution is to allocate memory as soon as possible, before
memory gets fragmented.
If you only allocate small chunks (page or few pages), just use kmalloc and don't worry about details. :)
Above answer has been copied from source - http://kerneltrap.org/node/4020

kmalloc returns physically contiguous memory,
kmalloc memory is reserved and
locked, it cannot be swapped, Memory
is subject to fragmentation, If you
don't need contiguous mapping in
kernel space, you can use vmalloc to
avoid the fragmentation problem.

Related

How is the heap divided up among processes?

I understand that each process has their own, separate heap unlike threads (which share a common heap, which thus slows heap memory allocation down as functions like malloc need to use locks for synchronization). However, how does it get decided where, and how much, memory is given to each process, and how is it ensured that this does not conflict with the memory allocated to other processes?
I have not been able to find a definitive answer on this through searching, but if one exists, please provide a link as I would greatly appreciate it. Thank you!
In order to answer the question, you need to understand about virtual memory. In virtual memory, the memory is contiguous as to what user processes can see. The heap is given a very big about of the virtual memory which is limited only by the amount of physical RAM and swap space to back the allocations. In itself the process only sees a contiguous virtual address space. On Linux, the memory allocations are done using the buddy algorithm and the kernel keeps a page struct for every page. The page struct along with the memory map of the process in the task_struct thus allows the Linux kernel to follow what page is free and which isn't.

Paging and non-contiguous memory allocation

I have few doubts regarding the memory management in a x86_64 Linux Operating System.
If I allocate an array of 2000 bytes (statically - arr[2000]; or dynamically - malloc(2000);) from my user space code, are these going to be a contiguous memory in physical memory?
If I allocate memory (same above example, statically - arr[2000]; or dynamically - malloc(2000);) will there be a page table updation to map to these 2000 bytes in physical memory, so that the future references to these memory addresses can be found from the Page Table Entry?
1) Very unlikely. Its possible that "your" malloc() might seem to produce the result but you cannot rely on it.
What you would want to do is malloc(4000) and then have two pointers. One to the malloc and the other at pointer1+2000.
Be careful that when you free(pointer1) that you also nullify pointer2.
2) Not until you reference a byte within the area.

Linux page allocation

In linux if malloc can't allocate data chunk from preallocated page. it uses mmap() or brk() to assign new pages. I wanted to clarify a few things :
I don't understand the following statment , I thought that when I use mmap() or brk() the kernel maps me a whole page. but the allocator alocates me only what I asked for from that new page? In this case trying to access unallocated space (In the newly mapped page) will not cause a page fault? (I understand that its not recommended )
The libc allocator manages each page: slicing them into smaller blocks, assigning them to processes, freeing them, and so on. For example, if your program uses 4097 bytes total, you need to use two pages, even though in reality the allocator gives you somewhere between 4105 to 4109 bytes
How does the allocator knows the VMA bounders ?(I assume no system call used) because the VMA struct that hold that information can only get accessed from kernel mode?
The system-level memory allocation (via mmap and brk) is all page-aligned and page-sized. This means if you use malloc (or other libc API that allocates memory) some small amount of memory (say 10 bytes), you are guaranteed that all the other bytes on that page of memory are readable without triggering a page fault.
Malloc and family do their own bookkeeping within the pages returned from the OS, so the mmap'd pages used by libc also contain a bunch of malloc metadata, in addition to whatever space you've allocated.
The libc allocator knows where everything is because it invokes the brk() and mmap() calls. If it invokes mmap() it passes in a size, and the kernel returns a start address. Then the libc allocator just stores those values in its metadata.
Doug Lea's malloc implementation is a very, very well documented memory allocator implementation and its comments will shed a lot of light on how allocators work in general:
http://g.oswego.edu/dl/html/malloc.html

Why is the kernel concerned about issuing PHYSICALLY contiguous pages?

When a process requests physical memory pages from the Linux kernel, the kernel does its best to provide a block of pages that are physically contiguous in memory. I was wondering why it matters that the pages are PHYSICALLY contiguous; after all, the kernel can obscure this fact by simply providing pages that are VIRTUALLY contiguous.
Yet the kernel certainly tries its hardest to provide pages that are PHYSICALLY contiguous, so I'm trying to figure out why physical contiguity matters so much. I did some research &, across a few sources, uncovered the following reasons:
1) makes better use of the cache & achieves lower avg memory access times (GigaQuantum: I don’t understand: how?)
2) you have to fiddle with the kernel page tables in order to map pages that AREN’T physically contiguous (GigaQuantum: I don’t understand this one: isn’t each page mapped separately? What fiddling has to be done?)
3) mapping pages that aren’t physically contiguous leads to greater TLB thrashing (GigaQuantum: I don’t understand: how?)
Per the comments I inserted, I don't really understand these 3 reasons. Nor did any of my research sources adequately explain/justify these 3 reasons. Can anyone explain these in a little more detail?
Thanks! Will help me to better understand the kernel...
The main answer really lies in your second point. Typically, when memory is allocated within the kernel, it isn't mapped at allocation time - instead, the kernel maps as much physical memory as it can up-front, using a simple linear mapping. At allocation time it just carves out some of this memory for the allocation - since the mapping isn't changed, it has to already be contiguous.
The large, linear mapping of physical memory is efficient: both because large pages can be used for it (which take up less space for page table entries and less TLB entries), and because altering the page tables is a slow process (so you want to avoid doing this at allocation/deallocation time).
Allocations that are only logically linear can be requested, using the vmalloc() interface rather than kmalloc().
On 64 bit systems the kernel's mapping can encompass the entireity of physical memory - on 32 bit systems (except those with a small amount of physical memory), only a proportion of physical memory is directly mapped.
Actually the behavior of memory allocation you describe is common for many OS kernels and the main reason is kernel physical pages allocator. Typically, kernel has one physical pages allocator that is used for allocation of pages for both kernel space (including pages for DMA) and user space. In kernel space you need continuos memory, because it's expensive (for in-kernel code) to map pages every time you need them. On x86_64, for example, it's completely worthless because kernel can see the whole address space (on 32bit systems there's 4G limitation of virtual address space, so typically top 1G are dedicated to kernel and bottom 3G to user-space).
Linux kernel uses buddy algorithm for page allocation, so that allocation of bigger chunk takes fewer iterations than allocation of smaller chunk (well, smaller chunks are obtained by splitting bigger chunks). Moreover, using of one allocator for both kernel space and user space allows the kernel to reduce fragmentation. Imagine that you allocate pages for user space by 1 page per iteration. If user space needs N pages, you make N iterations. What happens if kernel wants some continuos memory then? How can it build big enough continuos chunk if you stole 1 page from each big chunk and gave them to user space?
[update]
Actually, kernel allocates continuos blocks of memory for user space not as frequently as you might think. Sure, it allocates them when it builds ELF image of a file, when it creates readahead when user process reads a file, it creates them for IPC operations (pipe, socket buffers) or when user passes MAP_POPULATE flag to mmap syscall. But typically kernel uses "lazy" page loading scheme. It gives continuos space of virtual memory to user-space (when user does malloc first time or does mmap), but it doesn't fill the space with physical pages. It allocates pages only when page fault occurs. The same is true when user process does fork. In this case child process will have "read-only" address space. When child modifies some data, page fault occurs and kernel replaces the page in child address space with a new one (so that parent and child have different pages now). Typically kernel allocates only one page in these cases.
Of course there's a big question of memory fragmentation. Kernel space always needs continuos memory. If kernel would allocate pages for user-space from "random" physical locations, it'd be much more hard to get big chunk of continuos memory in kernel after some time (for example after a week of system uptime). Memory would be too fragmented in this case.
To solve this problem kernel uses "readahead" scheme. When page fault occurs in an address space of some process, kernel allocates and maps more than one page (because there's possibility that process will read/write data from the next page). And of course it uses physically continuos block of memory (if possible) in this case. Just to reduce potential fragmentation.
A couple of that I can think of:
DMA hardware often accesses memory in terms of physical addresses. If you have multiple pages worth of data to transfer from hardware, you're going to need a contiguous chunk of physical memory to do so. Some older DMA controllers even require that memory to be located at low physical addresses.
It allows the OS to leverage large pages. Some memory management units allow you to use a larger page size in your page table entries. This allows you to use fewer page table entries (and TLB slots) to access the same quantity of virtual memory. This reduces the likelihood of a TLB miss. Of course, if you want to allocate a 4MB page, you're going to need 4MB of contiguous physical memory to back it.
Memory-mapped I/O. Some devices could be mapped to I/O ranges that require a contiguous range of memory that spans multiple frames.
Contiguous or Non-Contiguous Memory Allocation request from the kernel depends on your application.
E.g. of Contiguous memory allocation: If you require a DMA operation to be performed then you will be requesting the contiguous memory through kmalloc() call as DMA operation requires a memory which is also physically contiguous , as in DMA you will provide only the starting address of the memory chunk and the other device will read or write from that location.
Some of the operation do not require the contiguous memory so you can request a memory chunk through vmalloc() which gives the pointer to non contagious physical memory.
So it is entirely dependent on the application which is requesting the memory.
Please remember that it is a good practice that if you are requesting the contiguous memory than it should be need based only as kernel is trying best to allocation the memory which is physically contiguous.Well kmalloc() and vmalloc() has their limits also.
Placing things we are going to be reading a lot physically close together takes advantage of spacial locality, things we need are more likely to be cached.
Not sure about this one
I believe this means if pages are not contiguous, the TLB has to do more work to find out where they all are. If they are contigous, we can express all the pages for a processes as PAGES_START + PAGE_OFFSET. If they aren't, we need to store a seperate index for all of the pages of a given processes. Because the TLB has a finite size and we need to access more data, this means we will be swapping in and out a lot more.
kernel does not need physically contiguous pages actually it just needs efficencies ans stabilities.
monolithic kernel tends to have one page table for kernel space shared among processes
and does not want page faults on kernel space that makes kernel designs too complex
so usual implementations on 32 bit architecture is always 3g/1g split for 4g address space
for 1g kernel space, normal mappings of code and data should not generate recursive page faults that is too complex to manage:
you need to find empty page frames, create mapping on mmu, and handle tlb flush for new mappings on every kernel side page fault
kernel is already busy of doing user side page faults
furthermore, 1:1 linear mapping could have much less page table entries because it can utilize bigger size of page unit (>4kb)
less entries leads to less tlb misses.
so buddy allocator on kernel linear address space always provides physically contiguous page frames
even most codes doesn't need contiguous frames
but many device drivers which need contiguous page frames already believe that allocated buffers through general kernel allocator are physically contiguous

What's all this uncommitted, reserved memory in my process?

I'm using VMMap from SysInternals to look at memory allocated by my Win32 C++ process on WinXP, and I see a bunch of allocations where portions of the allocated memory are reserved but not committed. As far as I can tell, from my reading and testing, all of the common memory allocators (e.g., malloc, new, LocalAlloc, GlobalAlloc) used in a C++ program always allocate fully committed blocks of memory.
Heaps are a common example of code that reserves memory but doesn't commit it until needed. I suspect that some of these blocks are Windows/CRT heaps, but there appears to be more of these types of blocks than I would expect for heaps. I see on the order of 30 of these blocks in my process, between 64k and 8MB in size, and I know that my code never intentionally calls VirtualAlloc to allocate reserved, uncommitted memory.
Here are a couple of examples from VMMap: http://www.flickr.com/photos/95123032#N00/5280550393/
What else would allocate such blocks of memory, where much of it is reserved but not committed? Would it make sense that my process has 30 heaps? Thanks.
I figured it out - it's the CRT heap that gets allocated by calls to malloc. If you allocate a large chunk of memory (e.g., 2 MB) using malloc, it allocates a single committed block of memory. But if you allocate smaller chunks (say 177kb), then it will reserve a 1 MB chunk of memory, but only commit approximately what you asked for (e.g., 184kb for my 177kb request).
When you free that small chunk, that larger 1 MB chunk is not returned to the OS. Everything but 4k is uncommitted, but the full 1 MB is still reserved. If you then call malloc again, it will attempt to use that 1 MB chunk to satisfy your request. If it can't satisfy your request with the memory that it's already reserved, it will allocate a new chunk of memory that's twice the previous allocation (in my case it went from 1 MB to 2 MB). I'm not sure if this pattern of doubling continues or not.
To actually return your freed memory to the OS, you can call _heapmin. I would think that this would make a future large allocation more likely to succeed, but it would all depend on memory fragmentation, and perhaps heapmin already gets called if an allocation fails (?), I'm not sure. There would also be a performance hit since heapmin would release the memory (taking time) and malloc would then need to re-allocate it from the OS when needed again. This information is for Windows/32 XP, your mileage may vary.
UPDATE: In my testing, heapmin did absolutely nothing. And the malloc heap is only used for blocks that are less than 512kb. Even if there are MBs of contiguous free space in the malloc heap, it will not use it for requests over 512kb. In my case, this freed, unused, yet reserved malloc memory chewed up huge parts of my process' 2GB address space, eventually leading to memory allocation failures. And since heapmin doesn't return the memory to the OS, I haven't found any solution to this problem, other than restarting my process or writing my own memory manager.
Whenever a thread is created in your application a certain (configurable) amount of memory will be reserved in the address space for the call stack of the thread. There's no need to commit all the reserved memory unless your thread is actually going to need all of that memory. So only a portion needs to be committed.
If more than the committed amount of memory is required, it will be possible to obtain more system memory.
The practical consideration is that the reserved memory is a hard limit on the stack size that reduces address space available to the application. However, by only committing a portion of the reserve, we don't have to consume the same amount of memory from the system until needed.
Therefore it is possible for each thread to have a portion of reserved uncommitted memory. I'm unsure what the page type will be in those cases.
Could they be the DLLs loaded into your process? DLLs (and the executable) are memory mapped into the process address space. I believe this initially just reserves space. The space is backed by the files themselves (at least initially) rather than the pagefile.
Only the code that's actually touched gets paged in. If I understand the terminology correctly, that's when it's committed.
You could confirm this by running your application in a debugger and looking at the modules that are loaded and comparing their locations and sizes to what you see in VMMap.

Resources