Windows HeapFree, msvcrt free: do they cause the memory being freed to be paged-in? I am trying to estimate if not freeing memory at exit would speed up application shutdown significantly.
NOTE: This is a very specific technical question. It's not about whether applications should or should not call free at exit.
If you don't cleanly deallocate all your resources at application shutdown it will make it nigh on impossible to detect if you have any really serious problems - like memory leaks - which would be more of a problem than a slow shut down. If the UI disappears quickly, then the user will think the it has shut down quickly even if it has a lot of work still to do. With UI, perception of speed is more important than actual speed. When the user selects the 'Exit Application' option, the main application window should immediately disappear. It doesn't matter if the application takes a few seconds after that to free up everything an exit gracefully, the user won't notice.
I ran a test for HeapFree. The following program has access violation inside HeapFree at i = 31999:
#include <windows.h>
int main() {
HANDLE heap = GetProcessHeap();
void * bufs[64000];
// populate heap
for (unsigned i = 0; i < _countof(bufs); ++i) {
bufs[i] = HeapAlloc(heap, 0, 4000);
}
// protect a block in the "middle"
DWORD dwOldProtect;
VirtualProtect(
bufs[_countof(bufs) / 2], 4000, PAGE_NOACCESS,
&dwOldProtect);
// free blocks
for (unsigned i = 0; i < _countof(bufs); ++i) {
HeapFree(heap, 0, bufs[i]);
}
}
The stack is
ntdll.dll!_RtlpCoalesceFreeBlocks#16() + 0x12b9 bytes
ntdll.dll!_RtlFreeHeap#12() + 0x91f bytes
shutfree.exe!main() Line 19 C++
So it looks like the answer is "Yes" (this applies to free as well, since it uses HeapFree internally)
I'm almost certain the answer to the speed improvement question would be "yes". Freeing a block may or may not touch the actual block in question, but it will certainly have to update other bookkeeping information in any case. If you have zillions of small objects allocated (it happens), then the effort required to free them all could have a significant impact.
If you can arrange it, you might try setting up your application such that if it knows it's going to quit, save any pending work (configuration, documents, whatever) and exit ungracefully.
Related
I have a program running under OpenCL where after I perform the calculations in private memory, I would like to write them to Global memory. I have no use for the results further down the road-essentially I am looking for a built in solution to write to Global memory from either __local or __private memory asynchronously.
I already tried async_work_group_copy and I noticed that in order to ensure the data is correctly copied I have to wait for the event. For my card AMD HD7970 this is the same as doing a synchronous copy directly to Global memory.
Does anyone have any experience with async_work_group_copy without waiting for the event or any other viable alternative?
for (...) {
//Calculate some results and copy to __local array src
event_t e = async_work_group_copy(dest, src, size, 0);
wait_group_events(1, &e); //Can we safely skip this??
}
Here src is __local and dest is __global.
I suspect that since this function has to be identical for the whole Group, skipping waiting for the event may not work since other local work items may not have finished. This is in a for loop which complicates things further.
I think there isn't much you have to (can) do in this situation. I know that the Intel's GPU implementation will not stall on a global write unless there's a register dependency hazard to soon after the write (e.g. if the program reuses that register too soon after the write, it'll stall until the dependency hazard clears). Sadly, you can't really control register allocation or even see it unfortunately though.
I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.
Atomic operations.
ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
arch/arm/include/asm/atomic.h :
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).
ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?
I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?
The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:
do
{
new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));
can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:
do
{
old_value = *dest;
new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);
... Assuming CompareAndStore is something like:
uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
do
{
if (__LDREXW(dest) != old_value) return 1; // Failure
} while(__STREXW(new_value, dest);
return 0;
}
This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]
Addendum
An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:
#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)
void do_update(void)
{
// The foo_pool_alloc() method should return a slot number in the lower bits and
// some sort of counter value in the upper bits so that once some particular
// uint32_t value is returned, that same value will not be returned again unless
// there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
// the possibility that while one task is performing its update, a second task
// changes the thing to a new one and releases the old one, and a third task gets
// given the newly-freed item and changes the thing to that, such that from the
// point of view of the first task, the thing never changed.)
uint32_t new_thing = foo_pool_alloc();
uint32_t old_thing;
do
{
// Capture old reference
old_thing = foo_current_thing;
// Compute new thing based on old one
update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
&foo_pool[old_thing & FOO_POOL_SIZE_MASK);
} while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
foo_pool_free(old_thing);
}
If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).
Okay, got the answer from their website.
If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.
My program has a custom allocator which gets memory from the OS using mmap(MAP_ANON | MAP_PRIVATE). When it no longer needs memory, the allocator calls either munmap or madvise(MADV_FREE). MADV_FREE keeps the mapping around, but tells the OS that it can throw away the physical pages associated with the mapping.
Calling MADV_FREE on pages you're going to need again eventually is much faster than calling munmap and later calling mmap again.
This almost works perfectly for me. The only problem is that, on MacOS, MADV_FREE is very lazy about getting rid of the pages I've asked it to free. In fact, it only gets rid of them when there's memory pressure from another application. Until it gets rid of the pages I've freed, MacOS reports that my program is still using that memory; in the Activity Monitor, its "Real Memory" column doesn't reflect the freed memory.
This makes it difficult for me to measure how much memory my program is actually using. (This difficulty in measuring RSS is keeping us from landing the custom allocator on 10.5.)
I could allocate a whole bunch of memory to force the OS to free up these pages, but in addition to taking a long time, that could have other side-effects, such as causing parts of my program to be paged out to disk.
On a lark, I tried the purge command, but that has no effect.
How can I force MacOS to clean out these MADV_FREE'd pages? Or, how can I ask MacOS how many MADV_FREE'd pages my process has in memory?
Here's a test program, if it helps. The Activity Monitor's "Real Memory" column shows 512MB after the program goes to sleep. On my Linux box, top shows 256MB of RSS, as desired.
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#define SIZE (512 * 1024 * 1024)
// We use MADV_FREE on Mac and MADV_DONTNEED on Linux.
#ifndef MADV_FREE
#define MADV_FREE MADV_DONTNEED
#endif
int main()
{
char *x = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
// Touch each page we mmap'ed so it gets a physical page.
int i;
for (i = 0; i < SIZE; i += 1024) {
x[i] = i;
}
madvise(x, SIZE / 2, MADV_FREE);
fprintf(stderr, "Sleeping. Now check my RSS. Hopefully it's %dMB.\n", SIZE / (2 * 1024 * 1024));
sleep(1024);
return 0;
}
mprotect(addr, length, PROT_NONE);
mprotect(addr, length, PROT_READ | PROT_WRITE);
Note as you say, madvise is lazier, and that is probably better for performance (just in case anyone is tempted to use this for performance rather than measurement).
Use MADV_FREE_REUSABLE on macOS. According to Apple's magazine_malloc implementation:
On OS X we use MADV_FREE_REUSABLE, which signals the kernel to remove the given pages from the memory statistics for our process. However, on returning that memory to use we have to signal that it has been reused.
https://opensource.apple.com/source/libmalloc/libmalloc-53.1.1/src/magazine_malloc.c.auto.html
Chromium, for example, also uses it:
MADV_FREE_REUSABLE is similar to MADV_FREE, but also marks the pages with the reusable bit, which allows both Activity Monitor and memory-infra to correctly track the pages.
https://github.com/chromium/chromium/blob/master/base/memory/discardable_shared_memory.cc#L377
I've looked and looked, and I don't think this is possible. :\
We're solving the problem by adding code to the allocator which explicitly decommits MADV_FREE'd pages when we ask it to.
I'm using
Win32 C++ in
CodeGear Builder 2009
Target is Windows XP Embedded.
I found the PROCESS_MEMORY_COUNTERS_EX struct
and I have created a siple function to return the
Memory consumption of my process
SIZE_T TForm1::ProcessPrivatBytes( DWORD processID )
{
SIZE_T lRetval = 0;
HANDLE hProcess;
PROCESS_MEMORY_COUNTERS_EX pmc;
hProcess = OpenProcess( PROCESS_QUERY_INFORMATION |
PROCESS_VM_READ,
FALSE, processID );
if (NULL == hProcess)
{
lRetval = 1;
}
else
{
if ( GetProcessMemoryInfo( hProcess, (PROCESS_MEMORY_COUNTERS*)&pmc, sizeof(pmc)) )
{
lRetval = pmc.WorkingSetSize;
lRetval = pmc.PrivateUsage;
}
CloseHandle( hProcess );
}
return lRetval;
}
//---------------------------------------------------------------------------
Do i have to use lRetval = pmc.WorkingSetSize; or lRetval = pmc.PrivateUsage;
the privateUsage are what I see in perfmon.
but what is that WorkingSetSize exactly.
I what to see every byte I allocate in the counter when I allocate it. Is this Posible?
regards
jvdn
This is a much tougher question than you probably realized. The reason is that Windows shares most executable code between processes (especially the ones that make up most of Windows itself) between processes. For example, there's normally ONE copy of kernel32.dll loaded into memory, but it'll normally be mapped into every process. Do you consider that part of the memory your process is "using" or not?
Private memory is what's unique to that particular process. This can be somewhat misleading too. Since the executable for your process could potentially be shared with another process (i.e. two instances of your program could be run), that's not counted as part of the private memory, even if (as is often the case) there's only one instance of it running.
The working set size is about 99.999% meaningless. What it returns is whatever has been set as the preferred working set size for the process. You can adjust that with SetProcessWorkingSetSize(). Windows has a working set trimmer that attempts to trim down working sets. If memory serves, it uses the working set size to guess at whether it's worth trying to trim the working set of this process -- i.e. if its current working set is larger than the working set size was set to, it tries to trim it down. Otherwise, it (mostly) leaves it alone.
Chances are that nothing you do will show you ever byte you allocate as you allocate it though. Calling Windows to allocate memory is fairly slow, so what's normally done is that the run-time library allocates a fairly big chunk of memory from Windows. When you allocate memory, the run-time library gives you a piece of that big chunk. Only when that chunk is gone does it go back to Windows and ask for more.
I'm writing the memory manager for an application, as part of a team of twenty-odd coders. We're running out of memory quota and we need to be able to see what's going on, since we only appear to be using about 700Mb. I need to be able to report where it's all going - fragmentation etc. Any ideas?
You can use existing memory debugging tools for this, I found Memory Validator 1 quite useful, it is able to track both API level (heap, new...) and OS level (Virtual Memory) allocations and show virtual memory maps.
The other option which I also found very usefull is to be able to dump a map of the whole virtual space based on VirtualQuery function. My code for this looks like this:
void PrintVMMap()
{
size_t start = 0;
// TODO: make portable - not compatible with /3GB, 64b OS or 64b app
size_t end = 1U<<31; // map 32b user space only - kernel space not accessible
SYSTEM_INFO si;
GetSystemInfo(&si);
size_t pageSize = si.dwPageSize;
size_t longestFreeApp = 0;
int index=0;
for (size_t addr = start; addr<end; )
{
MEMORY_BASIC_INFORMATION buffer;
SIZE_T retSize = VirtualQuery((void *)addr,&buffer,sizeof(buffer));
if (retSize==sizeof(buffer) && buffer.RegionSize>0)
{
// dump information about this region
printf(.... some buffer information here ....);
// track longest feee region - usefull fragmentation indicator
if (buffer.State&MEM_FREE)
{
if (buffer.RegionSize>longestFreeApp) longestFreeApp = buffer.RegionSize;
}
addr += buffer.RegionSize;
index+= buffer.RegionSize/pageSize;
}
else
{
// always proceed
addr += pageSize;
index++;
}
}
printf("Longest free VM region: %d",longestFreeApp);
}
You can also find out information about the heaps in a process with Heap32ListFirst/Heap32ListNext, and about loaded modules with Module32First/Module32Next, from the Tool Help API.
'Tool Help' originated on Windows 9x. The original process information API on Windows NT was PSAPI, which offers functions which partially (but not completely) overlap with Tool Help.
Our (huge) application (a Win32 game) started throwing "Not enough quota" exceptions recently, and I was charged with finding out where all the memory was going. It is not a trivial job - this question and this one were my first attempts at finding out. Heap behaviour is unexpected, and accurately tracking how much quota you've used and how much is available has so far proved impossible. In fact, it's not particularly useful information anyway - "quota" and "somewhere to put things" are subtly and annoyingly different concepts. The accepted answer is as good as it gets, although enumerating heaps and modules is also handy. I used DebugDiag from MS to view the true horror of the situation, and understand how hard it is to actually thoroughly track everything.