How can I force MacOS to release MADV_FREE'd pages? - macos

My program has a custom allocator which gets memory from the OS using mmap(MAP_ANON | MAP_PRIVATE). When it no longer needs memory, the allocator calls either munmap or madvise(MADV_FREE). MADV_FREE keeps the mapping around, but tells the OS that it can throw away the physical pages associated with the mapping.
Calling MADV_FREE on pages you're going to need again eventually is much faster than calling munmap and later calling mmap again.
This almost works perfectly for me. The only problem is that, on MacOS, MADV_FREE is very lazy about getting rid of the pages I've asked it to free. In fact, it only gets rid of them when there's memory pressure from another application. Until it gets rid of the pages I've freed, MacOS reports that my program is still using that memory; in the Activity Monitor, its "Real Memory" column doesn't reflect the freed memory.
This makes it difficult for me to measure how much memory my program is actually using. (This difficulty in measuring RSS is keeping us from landing the custom allocator on 10.5.)
I could allocate a whole bunch of memory to force the OS to free up these pages, but in addition to taking a long time, that could have other side-effects, such as causing parts of my program to be paged out to disk.
On a lark, I tried the purge command, but that has no effect.
How can I force MacOS to clean out these MADV_FREE'd pages? Or, how can I ask MacOS how many MADV_FREE'd pages my process has in memory?
Here's a test program, if it helps. The Activity Monitor's "Real Memory" column shows 512MB after the program goes to sleep. On my Linux box, top shows 256MB of RSS, as desired.
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#define SIZE (512 * 1024 * 1024)
// We use MADV_FREE on Mac and MADV_DONTNEED on Linux.
#ifndef MADV_FREE
#define MADV_FREE MADV_DONTNEED
#endif
int main()
{
char *x = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
// Touch each page we mmap'ed so it gets a physical page.
int i;
for (i = 0; i < SIZE; i += 1024) {
x[i] = i;
}
madvise(x, SIZE / 2, MADV_FREE);
fprintf(stderr, "Sleeping. Now check my RSS. Hopefully it's %dMB.\n", SIZE / (2 * 1024 * 1024));
sleep(1024);
return 0;
}

mprotect(addr, length, PROT_NONE);
mprotect(addr, length, PROT_READ | PROT_WRITE);
Note as you say, madvise is lazier, and that is probably better for performance (just in case anyone is tempted to use this for performance rather than measurement).

Use MADV_FREE_REUSABLE on macOS. According to Apple's magazine_malloc implementation:
On OS X we use MADV_FREE_REUSABLE, which signals the kernel to remove the given pages from the memory statistics for our process. However, on returning that memory to use we have to signal that it has been reused.
https://opensource.apple.com/source/libmalloc/libmalloc-53.1.1/src/magazine_malloc.c.auto.html
Chromium, for example, also uses it:
MADV_FREE_REUSABLE is similar to MADV_FREE, but also marks the pages with the reusable bit, which allows both Activity Monitor and memory-infra to correctly track the pages.
https://github.com/chromium/chromium/blob/master/base/memory/discardable_shared_memory.cc#L377

I've looked and looked, and I don't think this is possible. :\
We're solving the problem by adding code to the allocator which explicitly decommits MADV_FREE'd pages when we ask it to.

Related

How do I disable ASLR for heap addresses for a program compiled and linked with mingw-w64 GCC? [duplicate]

For debugging purposes, I would like malloc to return the same addresses every time the program is executed, however in MSVC this is not the case.
For example:
#include <stdlib.h>
#include <stdio.h>
int main() {
int test = 5;
printf("Stack: %p\n", &test);
printf("Heap: %p\n", malloc(4));
return 0;
}
Compiling with cygwin's gcc, I get the same Stack address and Heap address everytime, while compiling with MSVC with aslr off...
cl t.c /link /DYNAMICBASE:NO /NXCOMPAT:NO
...I get the same Stack address every time, but the Heap address changes.
I have already tried adding the registry value HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\MoveImages but it does not work.
Both the stack address and the pointer returned by malloc() may be different every time. As a matter of fact both differ when the program is compiled and run on Mac/OS multiple times.
The compiler and/or the OS may cause this behavior to try and make it more difficult to exploit software flaws. There might be a way to prevent this in some cases, but if your goal is to replay the same series of malloc() addresses, other factors may change the addresses, such as time sensitive behaviors, file system side effects, not to mention non-deterministic thread behavior. You should try and avoid relying on this for your tests.
Note also that &test should be cast as (void *) as %p expects a void pointer, which is not guaranteed to have the same representation as int *.
It turns out that you may not be able to obtain deterministic behaviour from the MSVC runtime libraries. Both the debug and the production versions of the C/C++ runtime libraries end up calling a function named _malloc_base(), which in turn calls the Win32 API function HeapAlloc(). Unfortunately, neither HeapAlloc() nor the function that provides its heap, HeapCreate(), document a flag or other way to obtain deterministic behaviour.
You could roll up your own allocation scheme on top of VirtualAlloc(), as suggested by #Enosh_Cohen, but then you'd loose the debug functionality offered by the MSVC allocation functions.
Diomidis' answer suggests making a new malloc on top of VirtualAlloc, so I did that. It turned out to be somewhat challenging because VirtualAlloc itself is not deterministic, so I'm documenting the procedure I used.
First, grab Doug Lea's malloc. (The ftp link to the source is broken; use this http alternative.)
Then, replace the win32mmap function with this (hereby placed into the public domain, just like Doug Lea's malloc itself):
static void* win32mmap(size_t size) {
/* Where to ask for the next address from VirtualAlloc. */
static char *next_address = (char*)(0x1000000);
/* Return value from VirtualAlloc. */
void *ptr = 0;
/* Number of calls to VirtualAlloc we have made. */
int tries = 0;
while (!ptr && tries < 100) {
ptr = VirtualAlloc(next_address, size,
MEM_RESERVE|MEM_COMMIT, PAGE_READWRITE);
if (!ptr) {
/* Perhaps the requested address is already in use. Try again
* after moving the pointer. */
next_address += 0x1000000;
tries++;
}
else {
/* Advance the request boundary. */
next_address += size;
}
}
/* Either we got a non-NULL result, or we exceeded the retry limit
* and are going to return MFAIL. */
return (ptr != 0)? ptr: MFAIL;
}
Now compile and link the resulting malloc.c with your program, thereby overriding the MSVCRT allocator.
With this, I now get consistent malloc addresses.
But beware:
The exact address I used, 0x1000000, was chosen by enumerating my address space using VirtualQuery to look for a large, consistently available hole. The address space layout appears to have some unavoidable non-determinism even with ASLR disabled. You may have to adjust the value.
I confirmed this works, in my particular circumstances, to get the same addresses during 100 sequential runs. That's good enough for the debugging I want to do, but the values might change after enough iterations, or after rebooting, etc.
This modification should not be used in production code, only for debugging. The retry limit is a hack, and I've done nothing to track when the heap shrinks.

Allocate swappable memory in linux kernel

Memory in the Linux kernel is usually unswappable (Do Kernel pages get swapped out?). However, sometimes it is useful to allow memory to be swapped out. Is it possible to explicitly allocate swappable memory inside the Linux kernel? One method I thought of was to create a user space process and use its memory. Is there anything better?
You can create a file in the internal shm shared memory filesystem.
const char *name = "example";
loff_t size = PAGE_SIZE;
unsigned long flags = 0;
struct file *filp = shmem_file_setup(name, size, flags);
/* assert(!IS_ERR(filp)); */
The file isn't actually linked, so the name isn't visible. The flags may include VM_NORESERVE to skip accounting up-front, instead accounting as pages are allocated. Now you have a shmem file. You can map a page like so:
struct address_space *mapping = filp->f_mapping;
pgoff_t index = 0;
struct page *p = shmem_read_mapping_page(mapping, index);
/* assert(!IS_ERR(filp)); */
void *data = page_to_virt(p);
memset(data, 0, PAGE_SIZE);
There is also shmem_read_mapping_page_gfp(..., gfp_t) to specify how the page is allocated. Don't forget to put the page back when you're done with it.
put_page(p);
Ditto with the file.
fput(filp);
Answer to your question is a simple No, or Yes with a complex modification to kernel source.
First, to enable swapping out, you have to ask yourself what is happening when kswapd is swapping out. Essentially it will walk through all the processes and make a decision whether its memory can be swapped out or not. And all these memory have the hardware mode of ring 3. So SMAP essentially forbid it from being read as data or executed as program in the kernel (ring 0):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
And check your distros "CONFIG_X86_SMAP", for mine Ubuntu it is default to "y" which is the case for past few years.
But if you keep your memory as a kernel address (ring 0), then you may need to consider changing the kswapd operation to trigger swapout of kernel addresses. Whick kernel addresses to walk first? And what if the address is part of the kswapd's kernel operation? The complexities involved is huge.
And next is to consider the swap in operation: When the memory read is attempted and it's "not present" bit is enabled, then hardware exception will trigger linux kernel memory fault handler (which is __do_page_fault()).
And looking into __do_page_fault:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1477
and there after how it handler the kernel addresses (do_kern_address_fault()):
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1174
which essentially is just reporting as error for possible scenario. If you want to enable kernel address pagefaulting, then this path has to be modified.
And note too that the SMAP check (inside smap_violation) is done in the user address pagefaulting (do_usr_addr_fault()).

Linux OOM killer does not work

I would like to test if the kernel OOM killer work fine on my embedded Linux or not. I used an application test to fill all memory and see if OOM will kill my application if the system run in out of memory condition.
The test program I used:
#include <stdio.h>
#include <stdlib.h>
#define MEGABYTE 1024*1024
int main(int argc, char *argv[])
{
void *myblock = NULL;
int count = 0;
while(1)
{
myblock = (void *) malloc(MEGABYTE);
if (!myblock) break;
memset(myblock,1, MEGABYTE);
printf("Currently allocating %d MB\n",++count);
}
exit(0);
}
Results:
I always get :
MyApplication triggered out of memory codition (oom killer not called): gfp_mask=0x1200d2, order=0, oomkilladj=0
I try to change /etc/sysctl by adding :
vm.oom_kill_allocating_task=1
vm.panic_on_oom=0
vm.overcommit_memory=0
how can I make OOM works fine on my system
Kernel version :2.6.30 #7 SMP PREEMPT
The Linux “OOM killer” is a solution to the overcommit problem.
If you just “fill all memory”, then overcommit will not show up. The malloc call will eventually return a null pointer, the convention to indicate that the memory request cannot be fulfilled.
In order to cause an overcommit-related problem, you must allocate too much memory without writing to it, and then decide to write to all of it, so that the system finds itself forced to honor promises it made without having the capacity to fulfill them.
EDIT after source code was provided:
To be completely precise, in order to trigger a problem with overcommit and force the Linux OOM killer to take action, you should have several processes that in a first phase all reserve memory with malloc() (but do not write to it yet). Then have all of them write to the memory they have reserved at the same time. This will force Linux to honor the memory promises outside of any memory allocation, and it will have no choice but to kill a process that wasn't allocating (since none of them will be allocating at that moment).
Also, if you still want to see how or when OOM-killer works. I would suggest you to add fork() before while loop. That will create many processes, and eventually one of them OOM-killer will kill.

kzalloc() - Maxmum size at a single call?

What is the maximum size that we can allocate using kzalloc() in a single call?
This is a very frequently asked question. Also please let me know if i can verify that value.
The upper limit (number of bytes that can be allocated in a single kmalloc / kzalloc request), is a function of:
the processor – really, the page size – and
the number of buddy system freelists (MAX_ORDER).
On both x86 and ARM, with a standard page size of 4 Kb and MAX_ORDER of 11, the kmalloc upper limit on a single call is 4 MB!
Details, including explanations and code to test this, here:
http://kaiwantech.wordpress.com/2011/08/17/kmalloc-and-vmalloc-linux-kernel-memory-allocation-api-limits/
No different to kmalloc(). That's the question you should ask (or search), because kzalloc is just a thin wrapper that sets GFP_ZERO.
Up to about PAGE_SIZE (at least 4k) is no problem :p. Beyond that... you're right to say lots of people people have asked, it's definitely something you have to think about. Apparently it depends on the kernel version - there used to be a hard 128k limit, but it's been increased (or maybe dropped altogether) now. That's just the hard limit though, what you can actually get depends on a given system. (And very definitely on the kernel version).
Maybe read What is the difference between vmalloc and kmalloc?
You can always "verify" the allocation by checking the return value from kzalloc(), but by then you've probably already logged an allocation failure backtrace. Other than that, no - I don't think there's a good way to check in advance.
However, it depends on your kernel version and config. These limits normally locate in linux/slab.h, usually descripted as below(this example is under linux 2.6.32):
#define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
(MAX_ORDER + PAGE_SHIFT - 1) : 25)
#define KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_HIGH)
#define KMALLOC_MAX_ORDER (KMALLOC_SHIFT_HIGH - PAGE_SHIFT)
And you can test them with code below:
#include <linux/module.h>
#include <linux/slab.h>
int init_module()
{
printk(KERN_INFO "KMALLOC_SHILFT_LOW:%d, KMALLOC_SHILFT_HIGH:%d, KMALLOC_MIN_SIZE:%d, KMALLOC_MAX_SIZE:%lu\n",
KMALLOC_SHIFT_LOW, KMALLOC_SHIFT_HIGH, KMALLOC_MIN_SIZE, KMALLOC_MAX_SIZE);
return 0;
}
void cleanup_module()
{
return;
}
Finally, the results under linux 2.6.32 32bits are: 3, 22, 8, 4194304, it means the min size is 8 bytes, and the max size is 4MB.
PS.
you can also check the actual size of memory allocated by kmalloc, just use ksize(), i.e.
void *p = kmalloc(15, GFP_KERNEL);
printk(KERN_INFO "%u\n", ksize(p)); /* this will print "16" under my kernel */

Do memory deallocation routines touch the block being freed?

Windows HeapFree, msvcrt free: do they cause the memory being freed to be paged-in? I am trying to estimate if not freeing memory at exit would speed up application shutdown significantly.
NOTE: This is a very specific technical question. It's not about whether applications should or should not call free at exit.
If you don't cleanly deallocate all your resources at application shutdown it will make it nigh on impossible to detect if you have any really serious problems - like memory leaks - which would be more of a problem than a slow shut down. If the UI disappears quickly, then the user will think the it has shut down quickly even if it has a lot of work still to do. With UI, perception of speed is more important than actual speed. When the user selects the 'Exit Application' option, the main application window should immediately disappear. It doesn't matter if the application takes a few seconds after that to free up everything an exit gracefully, the user won't notice.
I ran a test for HeapFree. The following program has access violation inside HeapFree at i = 31999:
#include <windows.h>
int main() {
HANDLE heap = GetProcessHeap();
void * bufs[64000];
// populate heap
for (unsigned i = 0; i < _countof(bufs); ++i) {
bufs[i] = HeapAlloc(heap, 0, 4000);
}
// protect a block in the "middle"
DWORD dwOldProtect;
VirtualProtect(
bufs[_countof(bufs) / 2], 4000, PAGE_NOACCESS,
&dwOldProtect);
// free blocks
for (unsigned i = 0; i < _countof(bufs); ++i) {
HeapFree(heap, 0, bufs[i]);
}
}
The stack is
ntdll.dll!_RtlpCoalesceFreeBlocks#16() + 0x12b9 bytes
ntdll.dll!_RtlFreeHeap#12() + 0x91f bytes
shutfree.exe!main() Line 19 C++
So it looks like the answer is "Yes" (this applies to free as well, since it uses HeapFree internally)
I'm almost certain the answer to the speed improvement question would be "yes". Freeing a block may or may not touch the actual block in question, but it will certainly have to update other bookkeeping information in any case. If you have zillions of small objects allocated (it happens), then the effort required to free them all could have a significant impact.
If you can arrange it, you might try setting up your application such that if it knows it's going to quit, save any pending work (configuration, documents, whatever) and exit ungracefully.

Resources