Why memset function make the virtual memory so large - memory-management

I have a process will do much lithography calculation, so I used mmap to alloc some memory for memory pool. When process need a large chunk of memory, I used mmap to alloc a chunk, after use it then put it in the memory pool, if the same chunk memory is needed again in the process, get it from the pool directly, not used memory map again.(not alloc all the need memory and put it in the pool at the beginning of the process). Between mmaps function, there are some memory malloc not used mmap, such as malloc() or new().
Now the question is:
If I used memset() to set all the chunk data to ZERO before putting it in the memory pool, the process will use too much virtual memory as following, format is "mmap(size)=virtual address":
mmap(4198400)=0x2aaab4007000
mmap(4198400)=0x2aaab940c000
mmap(8392704)=0x2aaabd80f000
mmap(8392704)=0x2aaad6883000
mmap(67112960)=0x2aaad7084000
mmap(8392704)=0x2aaadb085000
mmap(2101248)=0x2aaadb886000
mmap(8392704)=0x2aaadba89000
mmap(67112960)=0x2aaadc28a000
mmap(2101248)=0x2aaae028b000
mmap(2101248)=0x2aaae0c8d000
mmap(2101248)=0x2aaae0e8e000
mmap(8392704)=0x2aaae108f000
mmap(8392704)=0x2aaae1890000
mmap(4198400)=0x2aaae2091000
mmap(4198400)=0x2aaae6494000
mmap(8392704)=0x2aaaea897000
mmap(8392704)=0x2aaaeb098000
mmap(2101248)=0x2aaaeb899000
mmap(8392704)=0x2aaaeba9a000
mmap(2101248)=0x2aaaeca9c000
mmap(8392704)=0x2aaaec29b000
mmap(8392704)=0x2aaaecc9d000
mmap(2101248)=0x2aaaed49e000
mmap(8392704)=0x2aaafd6a7000
mmap(2101248)=0x2aacc5f8c000
The mmap last - first = 0x2aacc5f8c000 - 0x2aaab4007000 = 8.28G
But if I don't call memset before put in the memory pool:
mmap(4198400)=0x2aaab4007000
mmap(8392704)=0x2aaab940c000
mmap(8392704)=0x2aaad2480000
mmap(67112960)=0x2aaad2c81000
mmap(2101248)=0x2aaad6c82000
mmap(4198400)=0x2aaad6e83000
mmap(8392704)=0x2aaadb288000
mmap(8392704)=0x2aaadba89000
mmap(67112960)=0x2aaadc28a000
mmap(2101248)=0x2aaae0a8c000
mmap(2101248)=0x2aaae0c8d000
mmap(2101248)=0x2aaae0e8e000
mmap(8392704)=0x2aaae1890000
mmap(8392704)=0x2aaae108f000
mmap(4198400)=0x2aaae2091000
mmap(4198400)=0x2aaae6494000
mmap(8392704)=0x2aaaea897000
mmap(8392704)=0x2aaaeb098000
mmap(2101248)=0x2aaaeb899000
mmap(8392704)=0x2aaaeba9a000
mmap(2101248)=0x2aaaec29b000
mmap(8392704)=0x2aaaec49c000
mmap(8392704)=0x2aaaecc9d000
mmap(2101248)=0x2aaaed49e000
The mmap last - first = 0x2aaaed49e000 - 0x2aaab4007000= 916M
So the first process will "out of memory" and killed.
In the process, the mmap memory chunk will not be fully used or not even used although it is alloced, I mean, for example, before calibration, the process mmap 67112960(64M), it will not used(write or read data in this memory region) or just used the first 2M bytes, then put in the memory pool.
I know the mmap just return virtual address, the physical memory used delay alloc, it will be alloced when read or write on these address.
But what made me confused is that, why the virtual address increase so much? I used the centos 5.3, kernel version is 2.6.18, I tried this process both on libhoard and the GLIBC(ptmalloc), both with the same behavior.
Do anyone meet the same issue before, what is the possible root cause?
Thanks.

VMAs (virtual memory areas, AKA memory mappings) do not need to be contiguous. Your first example uses ~256 Mb, the second ~246 Mb.
Common malloc() implementations use mmap() automatically for large allocations (usually larger than 64Kb), freeing the corresponding chunks with munmap(). So you do not need to mmap() manually for large allocations, your malloc() library will take care of that.
When mmap()ing, the kernel returns a COW copy of a special zero page, so it doesn't allocate memory until it's written to. Your zeroing is causing memory to be really allocated, better just return it to the allocator, and request a new memory chunk when you need it.
Conclusion: don't write your own memory management unless the system one has proven inadecuate for your needs, and then use your own memory management only when you have proved it noticeably better for your needs with real life load.

Related

What happen if kmem_cache has no free memory for allocating?

I create a slab cache by kmem_cache_create(... size), then allocate memory from this cache by kmem_cache_alloc().
After I have allocated memory for "size" times, what happen if I call kmem_cache_alloc() to allocat size + 1th memory? Return NULL or extend cache implicitly?
The 'size' argument is not about memory reserved for anything. It is about the size of each allocation as returned by kmem_cache_alloc.
It may be there will be memory shortage in general, in which case, depending on flags pased to kmem_cache_alloc, the kernel may try to free some by e.g. shrinking caches.

Where is the heap?

I understand that in Linux the mm_struct describes the memory layout of a process. I also understand that the start_brk and brk mark the start and end of the heap section of a process respectively.
Now, this is my problem: I have a process, for which I wrote the source code, that allocates 5.25 GB of heap memory using malloc. However, when I examine the process's mm_sruct using a kernel module I find the value of is equal to 135168. And this is different from what I expected: I expected brk - start_brk to be equal slight above 5.25 GB.
So, what is going on here?
Thanks.
I notice the following in the manpage for malloc(3):
Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2). When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous mapping using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is adjustable using mallopt(3). Allocations performed using mmap(2) are unaffected by the RLIMIT_DATA resource limit (see getrlimit(2)).
So it sounds like mmap is used instead of the heap.

Is VirtualAlloc alignment consistent with size of allocation?

When using the VirtualAlloc API to allocate and commit a region of virtual memory with a power of two size of the page boundary such as:
void* address = VirtualAlloc(0, 0x10000, MEM_COMMIT, PAGE_READWRITE); // Get 64KB
The address seems to always be in 64KB alignment, not just the page boundary, which in my case is 4KB.
The question is: Is this alignment reliable and prescribed, or is it just coincidence? The docs state that it is guaranteed to be on a page boundary, but does not address the behavior I'm seeing. I ask because I'd later like to take an arbitrary pointer (provided by a pool allocator that uses this chunk) and determine which 64KB chunk it belongs to by something similar to:
void* chunk = (void*)((uintptr_t)ptr & 0xFFFF0000);
The documentation for VirtualAlloc describes the behavior for 2 scenarios: 1) Reserving memory and 2) Committing memory:
If the memory is being reserved, the specified address is rounded down to the nearest multiple of the allocation granularity.
If the memory is already reserved and is being committed, the address is rounded down to the next page boundary.
In other words, memory is allocated (reserved) in multiples of the allocation granularity and committed in multiples of a page size. If you are reserving and committing memory in a single step, it will be be aligned at a multiple of the allocation granularity. When committing already reserved memory it will be aligned at a page boundary.
To query a system's page size and allocation granularity, call GetSystemInfo. The SYSTEM_INFO structure's dwPageSize and dwAllocationGranularity will hold the page size and allocation granularity, respectively.
This is entirely normal. 64KB is the value of SYSTEM_INFO.dwAllocationGranularity. It is a simple counter-measure against address space fragmentation, 4KB pages are too small. The memory manager will still sub-divide 64KB chunks as needed if you change page protection of individual pages within the chunk.
Use HeapAlloc() to sub-allocate. The heap manager has specific counter-measures against fragmentation.

Why can't we to mark as WriteCombined already existing memory region by using `cudaHostRegister()`?

In CUDA SDK function cudaHostAlloc() for allocation new memory region can use flags:
cudaHostAllocDefault (default - 0 and causes cudaHostAlloc() to emulate cudaMallocHost())
cudaHostAllocPortable
cudaHostAllocMapped
cudaHostAllocWriteCombined
To mark memory region that already allocated we can use cudaHostRegister() with flags:
0 (default)
cudaHostRegisterPortable
cudaHostRegisterMapped
Why we can mark memory WriteCombined when allocating it by flag cudaHostAllocWriteCombined by using cudaHostAlloc(), but can't mark as WriteCombined already existing memory region by using cudaHostRegister()?
Already allocated memory we must will mark only through the POSIX function set_memory_wc()?
I did not know of any APIs that could change the cacheability of an existing VA range until you referenced set_memory_wc(). Such an operation would be extremely expensive due to all the cache flushes and TLB shootdowns that would be required; and the memory would basically be unreadable until you found some way to unmark it as WC.
Why are you trying to use WC memory? On pre-i7 (Nehalem) CPUs, WC had slightly higher transfer performance (IIRC) because it inhibited snooping of PCI Express traffic to and from the memory. But on Nehalem and later CPUs, I don't know of any application that has concretely demonstrated a benefit from WC memory.

Allocating a large DMA buffer

I want to allocate a large DMA buffer, about 40 MB in size. When I use dma_alloc_coherent(), it fails and what I see is:
------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2106 __alloc_pages_nodemask+0x1dc/0x788()
Modules linked in:
[<8004799c>] (unwind_backtrace+0x0/0xf8) from [<80078ae4>] (warn_slowpath_common+0x4c/0x64)
[<80078ae4>] (warn_slowpath_common+0x4c/0x64) from [<80078b18>] (warn_slowpath_null+0x1c/0x24)
[<80078b18>] (warn_slowpath_null+0x1c/0x24) from [<800dfbd0>] (__alloc_pages_nodemask+0x1dc/0x788)
[<800dfbd0>] (__alloc_pages_nodemask+0x1dc/0x788) from [<8004a880>] (__dma_alloc+0xa4/0x2fc)
[<8004a880>] (__dma_alloc+0xa4/0x2fc) from [<8004b0b4>] (dma_alloc_coherent+0x54/0x60)
[<8004b0b4>] (dma_alloc_coherent+0x54/0x60) from [<803ced70>] (mxc_ipu_ioctl+0x270/0x3ec)
[<803ced70>] (mxc_ipu_ioctl+0x270/0x3ec) from [<80123b78>] (do_vfs_ioctl+0x80/0x54c)
[<80123b78>] (do_vfs_ioctl+0x80/0x54c) from [<8012407c>] (sys_ioctl+0x38/0x5c)
[<8012407c>] (sys_ioctl+0x38/0x5c) from [<80041f80>] (ret_fast_syscall+0x0/0x30)
---[ end trace 4e0c10ffc7ffc0d8 ]---
I've tried different values and it looks like dma_alloc_coherent() can't allocate more than 2^25 bytes (32 MB).
How can such a large DMA buffer can be allocated?
After the system has booted up dma_alloc_coherent() is not necessarily reliable for large allocations. This is simply because non-moveable pages quickly fill up your physical memory making large contiguous ranges rare. This has been a problem for a long time.
Conveniently a recent patch-set may help you out, this is the contiguous memory allocator which appeared in kernel 3.5. If you're using a kernel with this then you should be able to pass cma=64M on your kernel command line and that much memory will be reserved (only moveable pages will be placed there). When you subsequently ask for your 40M allocation it should reliably succeed. Simples!
For more information check out this LWN article:
https://lwn.net/Articles/486301/

Resources