aligning memory in VirtualAlloc - winapi

As an optimization for some code on Win64, I am reserving 4GB of address space and then commit some number of MB (e.g. 512MB) within that address space (which ultimately gives huge perf improvements from array bound check removal, but that is all beside the point). The code I have basically looks like this:
LPVOID address = VirtualAlloc(null, FOUR_GB, MEM_RESERVE, PAGE_NOACCESS);
arrayAddress = VirtualAlloc(address, length, MEM_COMMIT, PAGE_READWRITE);
Someone on my team recently read a paper about large pages requiring less TLB lookups and having significant performance improvement in some scenarios, and this seemed like a prime candidate for trying that out.
However, what I'm reading makes me think that maybe this won't work. MSDN says that "the size and alignment must be a multiple of the large-page minimum". I can easily ensure that length is a multiple of large-page minimum, but how can I do this for alignment? If I could pass MEM_LARGE_PAGES flag to the reservation, I assume that would align it correctly. But I've read that you cannot call VirtualAlloc with MEM_RESERVE|MEM_LARGE_PAGES.
So my thought is that I can do what I'm doing now, but do VirtualAlloc with FOUR_GB+GetLargePageMinimum() during reservation and then align address to GetLargePageMinimum() when committing, but this doesn't feel right to me.
Does anyone know the correct way to go about it?

Related

Restricting virtual address range of a process?

On Linux x86_64, I have a simple application for which I track all memory accesses with Intel's PIN. The program uses only "a bit" of memory, most of it for dynamically allocated matrices (I've bisected the right value with ulimit). However, the memory accesses span the whole range of the VM address space, low addresses for what I presume global variables in the code, high addresses for the malloc()ed arrays.
There's a huge gap in the middle, and even in the high addresses the range is between 0x7fff4e1a37f4 and 0x7fea3af99000, which is much larger than what I would assume my application to use in total.
The post-processing that I need to do on the memory accesses deals very badly with these sparse accesses, so I'm looking for a way to restrict the virtual address range available to the process so that "it just fits", and accesses will show addresses between 0 and some more reasonable value for dynamically allocated memory (somewhere around the 40 Mb that I've discovered through ulimit).
Q: Is there an easy way to limit the available address space (and hence implicitly, available memory) to an individual process on Linux, ideally from the command line on a per-process basis?
Further notes:
I can link my application statically.
Even if I limit the memory with ulimit, the process still uses the full VM address range (not entirely unexpected).
I know about /proc/${pid}/maps, but would like to avoid creating wrappers to deal with this, and how to actually use the data in there.
I've heard about prelink (which may not apply to my static binary, but only libraries?) and can image that there are more intrusive ways to interfere with malloc(), but these solutions are too far out of my expertise to evaluate their usefulness (Limiting the heap area's Virtual address range, https://stackoverflow.com/a/29960208/60462)
If there's no simple command line-solution, instead of going for any elaborate hack, I'll probably just wing it in the post-processing and "normalize" the addresses e.g. via a few lines of perl).
ld.so(8) lists LD_PREFER_MAP_32BIT_EXEC as a way to restrict the address space to the lower 2GiB which is significantly less than the normal 64-bit address space, but possibly not small enough for your purposes.
It may also be possible to use the prctl(2) PR_SET_MM options to control the addresses of different parts of the program to solve your problem.

macOS equivalent of reserving memory without charging against commit limit

I often want large, contiguous regions of virtual address space that can grow on demand. On Windows, I do this my calling VirtualAlloc with MEM_RESERVE and a dwSize argument that I think is larger than the region is reasonably likely to ever need to grow, then committing a page at a time as needed. Not only does this defer mapping pages to physical memory until they're accessed, which committing the whole region to begin with would also do, it also defers charging the pages against the system commit limit. That way, the program doesn't limit how much memory other programs are allowed to commit for the sake of memory it isn't yet using and may never use. Basically, I want to handle my program's memory management such that it is in principle allowed to consume a large amount of memory if the user's demands require it, while at the same time allowing other programs to have that memory instead if they need it first.
Someone has already asked whether macOS has an equivalent to VirtualAlloc with MEM_RESERVE. The answers suggest that mmap with MAP_ANON | MAP_PRIVATE is roughly equivalent. What I'm wondering is, does macOS have an equivalent to the Windows commit limit, and does mmap, called with the right flags, behave like my VirtualAlloc usage in not charging against that limit?
Edit: someone else asked a similar question about Linux, for which mmap was also suggested. The answer suggests that on that platform, initially mapping the region with PROT_NONE and adding the desired privileges when the pages are needed is necessary to prevent the unused pages from counting against the commit limit. Lacking better documentation about why mapping memory in excess of the available physical memory is allowed in macOS (complete lack of commit limit? some kind of overcommit feature like Linux has?), I figure I might as well apply the same PROT_NONE trick just in case. At the very least it means I will be able to reuse the same code for Linux should I choose to support that platform as well.
I don't believe that macOS has a system commit limit. I don't find anything like that in the Mach APIs, which is the low-level VM API where I'd expect it to be. Likewise, I don't see anything like that in the output of sysctl -a, which reports many other VM details and statistics.
I can tell you for sure that I've reserved the entirety of the unused space in a 64-bit process (which has a 46-bit user address space). That is, just under 128TiB.
You can replicate that by just repeating mmap() calls with a large size until they fail, halve the size, and repeat, until your size is down to 1 page. Then, with the process paused at that point, apply vmmap -w -interleaved <pid> to it to see its allocations.

MapViewOfFileEx - valid lpBaseAddress

In answer to a question about mapping non-contiguous blocks of files into contiguous memory, here, it was suggested by one respondent that I should use VirtualAllocEx() with MEM_RESERVE in order to establish a 'safe' value for the final (lpBaseAddress) parameter for MapViewOfFileEx().
Further investigation revealed that this approach causes MapViewofFileEx() to fail with error 487: "Attempt to access invalid address." The MSDN page says:
"No other memory allocation can take place in the region that is used for mapping, including the use of the VirtualAlloc or VirtualAllocEx function to reserve memory."
While the documentation might be considered ambiguous with respect to valid sequences of calls, experimentation suggests that it is not valid to reserve memory for MapViewOfFileEx() using VirtualAllocEx().
On the web, I've found examples with hard-coded values - example:
#define BASE_MEM (VOID*)0x01000000
...
hMap = MapViewOfFileEx( hFile, FILE_MAP_WRITE, 0, 0, 0, BASE_MEM );
To me, this seems inadequate and unreliable... It is far from clear to me why this address is safe, or how many blocks can be safely be mapped there. It seems even more shaky given that I need my solution to work in the context of other allocations... and that I need my source to compile and work in both 32 and 64 bit contexts.
What I'd like to know is if there is any way to reliably reserve a pool of address space in order that - subsequently - it can be reliably used by MapViewOfFileEx to map blocks to explicit memory addresses.
You almost got to the solution by yourself but fell short of the last small step.
As you figured, use VirtualAlloc (with MEM_RESERVE) to find room in your address space, but after that (and before MapViewOfFileEx) use VirtualFree (with MEM_RELEASE). Now the address range will be free again. Then use the same memory address (returned by VirtualAlloc) with MapViewOfFileEx.
What you are trying to do is impossible.
From the MapViewOfFileEx docs, the pointer you supply is "A pointer to the memory address in the calling process address space where mapping begins. This must be a multiple of the system's memory allocation granularity, or the function fails."
The memory allocation granularity is 64K, so you cannot map disparate 4K pages from the file into adjacent 4K pages in virtual memory.
If you provide a base address, the function will try to map your file at that address. If it cannot use that base address (because something is already using all or part of the requested memory region), then the call will fail.
For most applications, there's no real point trying to fix the address yourself. If you're a sophisticated database process and you're trying to carefully manage your own memory layout on a machine with a known configuration for efficiency reasons, then it might be reasonable. But you'd have to be prepared for failure.
In 64-bit processes, the virtual address space is pretty wide open, so it might be possible to select a base address with some certainty, but I don't think I'd bother.
From MSDN:
While it is possible to specify an address that is safe now (not used by the operating system), there is no guarantee that the address will remain safe over time. Therefore, it is better to let the operating system choose the address.
I believe "over time" refers to future versions of the OS and whatever run-time libraries you're using (e.g., for memory allocation), which might take a different approach to memory layout.
Also:
If the lpBaseAddress parameter specifies a base offset, the function succeeds if the specified memory region is not already in use by the calling process. The system does not ensure that the same memory region is available for the memory mapped file in other 32-bit processes.
So basically, your instinct is right: specifying a base address is not reliable. You can try, but you must be prepared for failure.
So to directly answer your question:
What I'd like to know is if there is any way to reliably reserve a pool of address space in order that - subsequently - it can be reliably used by MapViewOfFileEx to map blocks to explicit memory addresses.
No, there isn't. Not without applying many constraints on the runtime environment (e.g., limiting to a specific version of the OS, setting base addresses for all of your DLLs, disallowing DLL injection, etc.).

Getting the lowest free virtual memory address in windows

Title says it pretty much all : is there a way to get the lowest free virtual memory address under windows ? I should add that I am interested by this information at the beginning of the program (before any dynamic memory allocation has been done).
Why I need it : trying to build a malloc implementation under Windows. If it is not possible I would have to really to whatever VirtualAlloc() returns when given NULL as first parameter. While you would expect it to do something sensible, like allocation memory at the bottom of what is available, there are no guarantees.
This can be implemented yourself by using VirtualQuery looking for pages that are marked as free. It would be relatively slow though. (You will also need to consider allocation granularity which is different from page size.)
I will say that unless you need contiguous blocks of memory, trying to keep everything close together is mostly meaningless since if two pages of virtual memory might be next to each other in the address space, there is no reason to assume they are close to each other in physical memory. In fact, even if they are close to each other at some point in time, if those pages get moved to backing store and then faulted back into memory, the page would not be faulted to the same physical address page.
The OS uses more complicated metrics than just what is the "lowest" memory address available. Specifically, VirtualAlloc allocates pages of memory, so depending on how much you're asking for, at least one page of unused address space has to be available at the starting address. So even if you think there's a "lower" address that it should have used, that address might not have been compatible with the operation that you asked for.

Does calling free or delete ever release memory back to the "system"

Here's my question: Does calling free or delete ever release memory back to the "system". By system I mean, does it ever reduce the data segment of the process?
Let's consider the memory allocator on Linux, i.e ptmalloc.
From what I know (please correct me if I am wrong), ptmalloc maintains a free list of memory blocks and when a request for memory allocation comes, it tries to allocate a memory block from this free list (I know, the allocator is much more complex than that but I am just putting it in simple words). If, however, it fails, it gets the memory from the system using say sbrk or brk system calls. When a memory is free'd, that block is placed in the free list.
Now consider this scenario, on peak load, a lot of objects have been allocated on heap. Now when the load decreases, the objects are free'd. So my question is: Once the object is free'd will the allocator do some calculations to find whether it should just keep this object in the free list or depending upon the current size of the free list it may decide to give that memory back to the system i.e decrease the data segment of the process using sbrk or brk?
Documentation of glibc tells me that if the allocation request is much larger than page size, it will be allocated using mmap and will be directly released back to the system once free'd. Cool. But let's say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what?
From what I know (correct me please), a memory allocated with malloc will never be released back to the system ever until the process ends i.e. the allocator will simply keep it in the free list if I free it. But the question that is troubling me is then, if I use a tool to see the memory usage of my process (I am using pmap on Linux, what do you guys use?), it should always show the memory used at peak load (as the memory is never given back to the system, except when allocated using mmap)? That is memory used by the process should never ever decrease(except the stack memory)? Is it?
I know I am missing something, so please shed some light on all this.
Experts, please clear my concepts regarding this. I will be grateful. I hope I was able to explain my question.
There isn't much overhead for malloc, so you are unlikely to achieve any run-time savings. There is, however, a good reason to implement an allocator on top of malloc, and that is to be able to trace memory leaks. For example, you can free all memory allocated by the program when it exits, and then check to see if your memory allocator calls balance (i.e. same number of calls to allocate/deallocate).
For your specific implementation, there is no reason to free() since the malloc won't release to system memory and so it will only release memory back to your own allocator.
Another reason for using a custom allocator is that you may be allocating many objects of the same size (i.e you have some data structure that you are allocating a lot). You may want to maintain a separate free list for this type of object, and free/allocate only from this special list. The advantage of this is that it will avoid memory fragmentation.
No.
It's actually a bad strategy for a number of reasons, so it doesn't happen --except-- as you note, there can be an exception for large allocations that can be directly made in pages.
It increases internal fragmentation and therefore can actually waste memory. (You can only return aligned pages to the OS, so pulling aligned pages out of a block will usually create two guaranteed-to-be-small blocks --smaller than a page, anyway-- to either side of the block. If this happens a lot you end up with the same total amount of usefully-allocated memory plus lots of useless small blocks.)
A kernel call is required, and kernel calls are slow, so it would slow down the program. It's much faster to just throw the block back into the heap.
Almost every program will either converge on a steady-state memory footprint or it will have an increasing footprint until exit. (Or, until near-exit.) Therefore, all the extra processing needed by a page-return mechanism would be completely wasted.
It is entirely implementation dependent. On Windows VC++ programs can return memory back to the system if the corresponding memory pages contain only free'd blocks.
I think that you have all the information you need to answer your own question. pmap shows the memory that is currenly being used by the process. So, if you call pmap before the process achieves peak memory, then no it will not show peak memory. if you call pmap just before the process exits, then it will show peak memory for a process that does not use mmap. If the process uses mmap, then if you call pmap at the point where maximum memory is being used, it will show peak memory usage, but this point may not be at the end of the process (it could occur anywhere).
This applies only to your current system (i.e. based on the documentation you have provided for free and mmap and malloc) but as the previous poster has stated, behavior of these is implmentation dependent.
This varies a bit from implementation to implementation.
Think of your memory as a massive long block, when you allocate to it you take a bit out of your memory (labeled '1' below):
111
If I allocate more more memory with malloc it gets some from the system:
1112222
If I now free '1':
___2222
It won't be returned to the system, because two is in front of it (and memory is given as a continous block). However if the end of the memory is freed, then that memory is returned to the system. If I freed '2' instead of '1'. I would get:
111
the bit where '2' was would be returned to the system.
The main benefit of freeing memory is that that bit can then be reallocated, as opposed to getting more memory from the system. e.g:
33_2222
I believe that the memory allocator in glibc can return memory back to the system, but whether it will or not depends on your memory allocation patterns.
Let's say you do something like this:
void *pointers[10000];
for(i = 0; i < 10000; i++)
pointers[i] = malloc(1024);
for(i = 0; i < 9999; i++)
free(pointers[i]);
The only part of the heap that can be safely returned to the system is the "wilderness chunk", which is at the end of the heap. This can be returned to the system using another sbrk system call, and the glibc memory allocator will do that when the size of this last chunk exceeds some threshold.
The above program would make 10000 small allocations, but only free the first 9999 of them. The last one should (assuming nothing else has called malloc, which is unlikely) be sitting right at the end of the heap. This would prevent the allocator from returning any memory to the system at all.
If you were to free the remaining allocation, glibc's malloc implementation should be able to return most of the pages allocated back to the system.
If you're allocating and freeing small chunks of memory, a few of which are long-lived, you could end up in a situation where you have a large chunk of memory allocated from the system, but you're only using a tiny fraction of it.
Here are some "advantages" to never releasing memory back to the system:
Having already used a lot of memory makes it very likely you will do so again, and
when you release memory the OS has to do quite a bit of paperwork
when you need it again, your memory allocator has to re-initialise all its data structures in the region it just received
Freed memory that isn't needed gets paged out to disk where it doesn't actually make that much difference
Often, even if you free 90% of your memory, fragmentation means that very few pages can actually be released, so the effort required to look for empty pages isn't terribly well spent
Many memory managers can perform TRIM operations where they return entirely unused blocks of memory to the OS. However, as several posts here have mentioned, it's entirely implementation dependent.
But lets say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what ?
This depends on your allocation pattern. Do you free ALL of the small allocations? If so and if the memory manager has handling for a small block allocations, then this may be possible. However, if you allocate many small items and then only free all but a few scattered items, you may fragment memory and make it impossible to TRIM blocks since each block will have only a few straggling allocations. In this case, you may want to use a different allocation scheme for the temporary allocations and the persistant ones so you can return the temporary allocations back to the OS.

Resources