How does the Freepascal Memory Manager work on enlarging arrays - memory-management

I recently stumbled over an issue within a FreePascal project I'm developing: The application requires a look-up array which may become very large during the runtime (a few million entries). Each array element is about 8 bytes in size.
I observed the following behavior of my application: If the array is already quite large (~130 MB), another enlargement will result in a peak in memory consumption and maybe also in a volatile raise of used RAM.
As far as I read, the peak may be explained with the internal behavior of the SetLength()-method which allocates memory with the size of the new array and then copies the old array to its new destination in memory.
But during examination of the sudden increase of used memory it seemed that there were situations where the "old" memory was not freed, resulting in a doubled usage of the RAM.
I was able to reproduce this behavior more clearly as I raised the steps in which the array was enlarged.
To become rid of this problem, I changed the memory manager to CMem and the issue was gone.
Unfortunately I did not found a clear description of the Free Pascal memory manager and I can only guess, that the space of the "old" (small) array is not used because the built-in memory manager wants the heap to be not-fragmented all the time, but I could not prove that.
Does someone of you have a source which describes the basic functionality of the Free Pascal memory manager and the C memory manager and/or the differences between both?
Thank you very much, kind regards
Alex

Related

How to deal with external fragmentation, how paging helps with external fragmentation?

I know that there is a lot of questions regarding the issue I'm pointing here, but I couldn't find any complex answer (neither on StackOverflow nor in other sources).
I would like to ask about heap (RAM) fragmentation problem.
As I understood there are two kind of fragmentation:
internal - related with difference between allocation unit size (AU) and the size of the allocated memory AM (waste memory is equal to AM % AU),
external - related with noncontinuous areas of a free memory, so even if the sum of the free memory areas can handle the new allocation request, it fails if there is no continues area that can handle it.
This is quite clear. The problems start when the "paging" appears.
Sometimes I can find an information that paging solves the external fragmentation issue.
Indeed I agree that thanks to paging the OS is able to create the virtually continues areas of the memory, assigned to the process, even if physically the parts of the memory are scattered.
But how exactly does it help with the external fragmentation?
I mean, assuming that the size of a page has 4kB, and we want to allocate 16 kB, then of course we just need to find four empty pages frames, even if physically the frames are not a part of a continues area.
But what in case of the smaller allocation ?
I believe the page itself can still be fragmented and (in worst case) the OS still needs to provide a new frame if the old one cannot be used to allocate the requested memory.
So is it that (assuming the worst case) sooner or later, with paging or without, the long working application that allocates and releases the heap memory (different sizes) will fall into low-memory condition, because of external fragmentation ?
So the question is how to deal with the external fragmentation?
Own implementation of allocation algorithm ? Paging (as I wrote, not sure it helps) ? What else ? Does OS (Windows, Linux) provides some defragmentation methods ?
The most radical solution is to forbid using of the heap, but is it really necessary for the platforms with paging, virtual address spaces, virtual memory etc ... and the only issue is that the applications need to run unstoppable for a years ?
One more issue.. is internal fragmentation an ambiguous term ?
Somewhere I have spotted the definition that internal fragmentation points to the part of page frame, that is wasted because the process does not need more memory, but the single frame cannot be owned by more than a single processes.
I have bolded the questions, so the people who are in hurry could find the question without reading everything.
Regards!
"Fragmentation" is indeed not a very precise term. But we can say for sure that when a running application needs a block of n bytes and there are n or more bytes not in use, yet we can't get the required block, then "memory is too fragmented."
But how exactly does it [paging] help with the external allocation [I assume you mean fragmentation] ?
There's really nothing complicated here. External fragmentation is free memory between allocated blocks that's "too small" to satisfy any application requirement. This is a general concept. The definition of "too small" is application-dependent. Nonetheless, if allocated blocks can fall on any boundary, then it's easy, after many allocations and deallocations, for lots of such fragments to occur. Paging helps with external fragmentation in two ways.
First, it subdivides memory into fixed-size adjacent chunks - the pages - that are "large enough" so they're never useless. Again the definition of "large enough" is not precise. But most applications will have lots of requirements satisfiable by a single 4k page. Since no external fragmentation problem can occur for allocations of a page or less, the problem has been mitigated.
Second, the paging hardware provides a level of indirection between application pages and physical memory pages. Therefore any free physical memory page can be used to help satisfy any application request, no matter how large. For example, suppose you have 100 physical pages with every other physical page (50 of them) allocated. Without page-mapping hardware, the biggest request for contiguous memory that can be satisfied is 1 page. With mapping, it's 50 pages. (I'm disregarding virtual pages allocated initially with no mapped physical page. That's another discussion.)
But what in case of the smaller allocation ?
Again it's pretty simple. If the unit of allocation is a page, then any allocation smaller than a page yields an unused portion. This is internal fragmentation: unusable memory within an allocated block. The bigger you make allocation units (they don't have to be a single page), the more memory will be unusable due to internal fragmentation. On average, this will tend toward half of an allocation unit. Consequently, though OS's tend to allocate in units of pages, most application-side memory allocators request a very small number (often one) of big blocks (of pages) from the OS. They use much smaller allocation units internally: 4-16 bytes is pretty common.
So the question is how to deal with the external allocation [I assume you mean fragmentation] ? So is it that (assuming the worst case) sooner or later, with paging or without, the long working application that allocates and releases the heap memory (different sizes) will fall into low-memory condition, because of external fragmentation ?
If I understand you correctly, you're asking if fragmentation is inevitable. Except under very special conditions (e.g. the application only needs blocks of one size), the answer is yes. But that doesn't mean it's necessarily a problem.
Memory allocators use smart algorithms that limit fragmentation pretty effectively. For example, they may maintain "pools" with different block sizes, using the pool with block size most closely matching a given request. This tends to limit both internal and external fragmentation. A real world example that's very well documented is dlmalloc. The source code is also very clear.
Of course any general purpose allocator can fail under specific conditions. For this reason, modern languages (C++ and Ada are two I know) let you supply special-purpose allocators for objects of a given type. Typically -
for a fixed-size object - these might simply maintain a pre-allcoated free list, so fragmentation for that particular case is zero, and allocation/deallocation are very fast.
One more note: It's possible to totally eliminate fragmentation with copying/compacting garbage collection. Of course this requires underlying language support, and there's a performance bill to pay. A copying garbage collector compacts the heap by moving objects to eliminate unused space completely whenever it runs to reclaim storage. To do this it must update every pointer in the running program to the corresponding object's new location. While this may sound complex, I've implemented a copying garbage collector, and it's not so bad. The algorithms are extremely cool. Unfortunately, the semantics of many languages (e.g. C and C++) don't allow finding every pointer in the running program, which is required.
The most radical solution is to forbid using of the heap, but is it really necessary for the platforms with paging, virtual address spaces, virtual memory etc ... and the only issue is that the applications need to run unstoppable for a years ?
Though general purpose allocators are good, they're not guaranteed. It's not unusual for safety-critical or hard real time constrained systems to avoid heap use completely. On the other hand, when no absolute guarantee is needed, a general purpose allocator is often fine. There are many systems that run perfectly with tough loads for extended periods using general purpose allocators: fragmentation reaches an acceptable steady state and doesn't cause a problem.
One more issue.. is internal fragmentation an ambiguous term ?
The term isn't ambiguous, but is used in different contexts. The invariant is that it's referring to unused memory inside allocated blocks.
OS literature tends to assume the allocation unit is pages. For example, Linux sbrk lets you request the end of the data segment be set anywhere, but Linux allocates pages, not bytes, so the unused part of the last page is internal fragmentation from the OS's point of view.
Application-oriented discussions tend to assume allocation is in "blocks" or "chunks" of arbitrary size. dlmalloc uses about 128 discrete chunk sizes, each maintained in its own free list. Plus, it will custom allocate very large blocks using OS memory mapping system calls, so there's at most a page size (minus 1 byte) of mismatch between request and actual allocation. Clearly it's going to a lot of trouble to minimize internal fragmentation. The fragmentation caused a given allocation is the difference between the request and the chunk actually allocated. Since there are so many chunk sizes, that difference is strictly limited. On the other hand, the many chunk sizes increase chances of external fragmentation problems: free memory may consist entirely of chunks that are well-managed by dlmalloc, yet too small to honor an application requirement.

What is Go's memory footprint

This article on Wired about Dropbox's switch from Go to Rust for its MagicPocket product says
“memory footprint”—the amount of computer memory it demands while running Magic Pocket—was too high for the massive storage systems the company was trying to build.
Question(s): What exactly is Go's "memory footprint" (where does it come from, how is it measured etc, is it related to garbage collection,binary size, is it something that will always be high) and why is it higher than Rust's?
"It had a high memory footprint" is just another way to say their program used a lot of RAM. It is related to garbage collection in that GC'd programs only free memory periodically (because each GC cycle takes CPU time), whereas manual memory management tends to free memory more or less as soon as it's unused.
The downside of manual memory management is either that mistakes can cause crashes and security bugs (as in C++, where you can accidentally use a freed variable after the memory has been reused for something else) or you have to put effort into expressing the exact lifetimes of each variable, reference, etc. in your code so that the compiler can check that they're being used in a valid way (as in Rust, where you interact with the borrow checker to root out potentially incorrect uses of memory in your code).
The sentence in the Wired story makes it sound like "memory footprint" is a simple measurable quantity you could assign to any language (and your question takes that idea to its logical conclusion). It's not quite that simple. In different languages, doing different things has different costs in memory, performance, and so on, and you kind of have to understand languages'/runtimes' details to know how the language will work with a given sort of program.
For example, CPython has reference counting, and that frees unused memory sooner but at the cost of having to store and update reference counts. Java has, on the one hand, things like object headers that add a certain amount of memory overhead per object, but uses some tricks to speed garbage collection (like generational collection) that Go doesn't (yet). Or in Go, you might try to reduce the memory footprint of a program by recycling memory with free pools and adjusting GOGC to free unused memory more often as kostya said.
The bigger point there is not that those specific details I listed are super important, but that there can be a lot of details to consider other than "higher memory footprint" or "lower memory footprint."
So: "memory footprint" refers to the amount of RAM a particular program with a particular workload takes up. Bigger picture, it's one factor in a large set of tradeoffs that folks like you or I or Dropbox's team have to navigate.
The garbage collector requires free memory to be available to work efficiently. By default Go application needs roughly twice as much memory as size of live data set (memory occupied by application objects).
This can be tuned using GOGC environment variable. By setting it to a lower value the application will request less memory from OS but GC will run more frequently therefore will use more CPU resources. By setting it to a higher value the GC will run less frequently and use less resources but the application will have higher "memory footprint".
This is general idea but the exact memory, performance requirements and GOGC effect are highly application specific.

Benefits of reserving vs. committing+reserving memory using VirtualAlloc on large arrays

I am writing a C++ program that essentially works with very large arrays. On Windows, I am using VirtualAlloc to allocate memory to my arrays. Now I fully understand the difference between reserving and committing memory using VirutalAlloc; however, I am wondering whether there is any benefit in committing memory page-by-page to a reserved region. In particular, MSDN (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx) contains the following explanation for the MEM_COMMIT option:
Actual physical pages are not allocated unless/until the virtual addresses are actually accessed.
My experiments confirm this: I can reserve and commit several GB of memory wihtout increasing memory usage of my process (as shown in Task Manager); actual memory gets allocated only when I actually access memory.
Now I saw quite a few examples arguing that one should reserve a large portion of the address space and then commit memory page-by-page (or in some larger blocks, depending on the app's logic). As explained above, however, memory does not seem to be committed before one accesses it; thus, I'm wondering whether there is any real benefit in committing memory page-by-page. In fact, committing memory page-by-page might actually slow my program down due to many system calls for actually comitting memory. If I commit the entire region at once, I pay for just one system call, but the kernel seems to be smart enough to actually allocate only memory that I actually use.
I would appreciate it if someone could explain to me which strategy is better.
The difference is that commit "backs" the memory against the page file. To give an example:
Given 2GB of physical ram and 2GB of swap (assume fixed-size swap for this purpose).
Reserve 6GB - OK.
Commit first 2GB - OK.
Commit remaining 4GB - fails.
Extend swap file to 8GB
Commit remaining 4GB - succeeds.
The reason for using MEM_COMMIT would primarily be for runtime error suppression (app stability). If you have a process that commits pages on-demand then there's always a chance that a commit along-the-way could fail if it exceeds amount of memory+swap available. When memory has been backed by the page file then you have a strong guarantee that the memory is available for use from now until the point that you release it.
There's a number of reasons to go one way or the other, and I don't think there's any perfect science to deciding which. MEM_RESERVE alone is only needed for very large sparse array scenarios, ex: multi-gigabyte array which has at most 25-33% utilization (a popular technique for accelerating hash tables, etc).
Almost everything else is gray area where you could probably go either way -- MEM_COMMIT up-front would make your own app a little more stable and essentially give it priority to physical ram over competing apps that might allocate on-demand. (if you grab the ram first then your app will be the last left standing when physical memory is exhausted) At the same time, if you're not actually using all that ram then you may end up limiting the multi-tasking potential of your client's machine or causing unnecessary wasted disk space via a growing page file.

How to show percentage of 'memory used' in a win32 process?

I know that memory usage is a very complex issue on Windows.
I am trying to write a UI control for a large application that shows a 'percentage of memory used' number, in order to give the user an indication that it may be time to clear up some memory, or more likely restart the application.
One implementation used ullAvailVirtual from MEMORYSTATUSEX as a base, then used HeapWalk() to walk the process heap looking for additional free memory. The HeapWalk() step was needed because we noticed that after a while of running the memory allocated and freed by the heap was never returned and reported by the ullAvailVirtual number. After hours of intensive working, the ullAvailVirtual number no longer would accurately report the amount of memory available.
However, this method proved not ideal, due to occasional odd errors that HeapWalk() would return, even when the process heap was not corrupted. Further, since this is a UI control, the heap walking code was executing every 5-10 seconds. I tried contacting Microsoft about why HeapWalk() was failing, escalated a case via MSDN, but never got an answer other than "you probably shouldn't do that".
So, as a second implementation, I used PagefileUsage from PROCESS_MEMORY_COUNTERS as a base. Then I used VirtualQueryEx to walk the virtual address space adding up all regions that weren't MEM_FREE and returned a value for GetMappedFileNameA(). My thinking was that the PageFileUsage was essentially 'private bytes' so if I added to that value the total size of the DLLs my process was using, it would be a good approximation of the amount of memory my process was using.
This second method seems to (sorta) work, at least it doesn't cause crashes like the heap walker method. However, when both methods are enabled, the values are not the same. So one of the methods is wrong.
So, StackOverflow world...how would you implement this?
which method is more promising, or do you have a third, better method?
should I go back to the original method, and further debug the odd errors?
should I stay away from walking the heap every 5-10 seconds?
Keep in mind the whole point is to indicate to the user that it is getting 'dangerous', and they should either free up memory or restart the application. Perhaps a 'percentage used' isn't the best solution to this problem? What is? Another idea I had was a color based system (red, yellow, green, which I could base on more factors than just a single number)
Yes, the Windows memory manager was optimized to fulfill requests for memory as quickly and efficiently possible, it was not optimized to easily measure how much space is used. The first downfall is that heap blocks that are released are rarely unmapped. They are simply marked as "free", to be used by the next allocation. That's why VirtualQueryEx() cannot work.
The problem with HeapWalk is that you have to lock the heap (HeapLock) so that it can walk it without the heap allocation changing. That lock can have very detrimental side-effects. Quoting:
Walking a heap may degrade
performance, especially on symmetric
multiprocessing (SMP) computers. The
side effects may last until the
process ends.
Even then, the number you get back is pretty meaningless. A program never runs out of free space, it runs out of a large enough contiguous chunk of memory to fulfill the request. No happy answers I'm afraid. Except one: a 64-bit operating system cost less than two hundred bucks.
The place to start is probably GetProcessMemoryInfo(). This fills in a structure for you that has, among other things, the current working set in bytes.
Have a look at the following article .NET and running processes
It uses WMI to check for the memory usage of processes, specifically using the
System.Diagnostics.Process
and another link on how to use WMI in C#: WMI Made Easy for C#
Hope this helps.

Does calling free or delete ever release memory back to the "system"

Here's my question: Does calling free or delete ever release memory back to the "system". By system I mean, does it ever reduce the data segment of the process?
Let's consider the memory allocator on Linux, i.e ptmalloc.
From what I know (please correct me if I am wrong), ptmalloc maintains a free list of memory blocks and when a request for memory allocation comes, it tries to allocate a memory block from this free list (I know, the allocator is much more complex than that but I am just putting it in simple words). If, however, it fails, it gets the memory from the system using say sbrk or brk system calls. When a memory is free'd, that block is placed in the free list.
Now consider this scenario, on peak load, a lot of objects have been allocated on heap. Now when the load decreases, the objects are free'd. So my question is: Once the object is free'd will the allocator do some calculations to find whether it should just keep this object in the free list or depending upon the current size of the free list it may decide to give that memory back to the system i.e decrease the data segment of the process using sbrk or brk?
Documentation of glibc tells me that if the allocation request is much larger than page size, it will be allocated using mmap and will be directly released back to the system once free'd. Cool. But let's say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what?
From what I know (correct me please), a memory allocated with malloc will never be released back to the system ever until the process ends i.e. the allocator will simply keep it in the free list if I free it. But the question that is troubling me is then, if I use a tool to see the memory usage of my process (I am using pmap on Linux, what do you guys use?), it should always show the memory used at peak load (as the memory is never given back to the system, except when allocated using mmap)? That is memory used by the process should never ever decrease(except the stack memory)? Is it?
I know I am missing something, so please shed some light on all this.
Experts, please clear my concepts regarding this. I will be grateful. I hope I was able to explain my question.
There isn't much overhead for malloc, so you are unlikely to achieve any run-time savings. There is, however, a good reason to implement an allocator on top of malloc, and that is to be able to trace memory leaks. For example, you can free all memory allocated by the program when it exits, and then check to see if your memory allocator calls balance (i.e. same number of calls to allocate/deallocate).
For your specific implementation, there is no reason to free() since the malloc won't release to system memory and so it will only release memory back to your own allocator.
Another reason for using a custom allocator is that you may be allocating many objects of the same size (i.e you have some data structure that you are allocating a lot). You may want to maintain a separate free list for this type of object, and free/allocate only from this special list. The advantage of this is that it will avoid memory fragmentation.
No.
It's actually a bad strategy for a number of reasons, so it doesn't happen --except-- as you note, there can be an exception for large allocations that can be directly made in pages.
It increases internal fragmentation and therefore can actually waste memory. (You can only return aligned pages to the OS, so pulling aligned pages out of a block will usually create two guaranteed-to-be-small blocks --smaller than a page, anyway-- to either side of the block. If this happens a lot you end up with the same total amount of usefully-allocated memory plus lots of useless small blocks.)
A kernel call is required, and kernel calls are slow, so it would slow down the program. It's much faster to just throw the block back into the heap.
Almost every program will either converge on a steady-state memory footprint or it will have an increasing footprint until exit. (Or, until near-exit.) Therefore, all the extra processing needed by a page-return mechanism would be completely wasted.
It is entirely implementation dependent. On Windows VC++ programs can return memory back to the system if the corresponding memory pages contain only free'd blocks.
I think that you have all the information you need to answer your own question. pmap shows the memory that is currenly being used by the process. So, if you call pmap before the process achieves peak memory, then no it will not show peak memory. if you call pmap just before the process exits, then it will show peak memory for a process that does not use mmap. If the process uses mmap, then if you call pmap at the point where maximum memory is being used, it will show peak memory usage, but this point may not be at the end of the process (it could occur anywhere).
This applies only to your current system (i.e. based on the documentation you have provided for free and mmap and malloc) but as the previous poster has stated, behavior of these is implmentation dependent.
This varies a bit from implementation to implementation.
Think of your memory as a massive long block, when you allocate to it you take a bit out of your memory (labeled '1' below):
111
If I allocate more more memory with malloc it gets some from the system:
1112222
If I now free '1':
___2222
It won't be returned to the system, because two is in front of it (and memory is given as a continous block). However if the end of the memory is freed, then that memory is returned to the system. If I freed '2' instead of '1'. I would get:
111
the bit where '2' was would be returned to the system.
The main benefit of freeing memory is that that bit can then be reallocated, as opposed to getting more memory from the system. e.g:
33_2222
I believe that the memory allocator in glibc can return memory back to the system, but whether it will or not depends on your memory allocation patterns.
Let's say you do something like this:
void *pointers[10000];
for(i = 0; i < 10000; i++)
pointers[i] = malloc(1024);
for(i = 0; i < 9999; i++)
free(pointers[i]);
The only part of the heap that can be safely returned to the system is the "wilderness chunk", which is at the end of the heap. This can be returned to the system using another sbrk system call, and the glibc memory allocator will do that when the size of this last chunk exceeds some threshold.
The above program would make 10000 small allocations, but only free the first 9999 of them. The last one should (assuming nothing else has called malloc, which is unlikely) be sitting right at the end of the heap. This would prevent the allocator from returning any memory to the system at all.
If you were to free the remaining allocation, glibc's malloc implementation should be able to return most of the pages allocated back to the system.
If you're allocating and freeing small chunks of memory, a few of which are long-lived, you could end up in a situation where you have a large chunk of memory allocated from the system, but you're only using a tiny fraction of it.
Here are some "advantages" to never releasing memory back to the system:
Having already used a lot of memory makes it very likely you will do so again, and
when you release memory the OS has to do quite a bit of paperwork
when you need it again, your memory allocator has to re-initialise all its data structures in the region it just received
Freed memory that isn't needed gets paged out to disk where it doesn't actually make that much difference
Often, even if you free 90% of your memory, fragmentation means that very few pages can actually be released, so the effort required to look for empty pages isn't terribly well spent
Many memory managers can perform TRIM operations where they return entirely unused blocks of memory to the OS. However, as several posts here have mentioned, it's entirely implementation dependent.
But lets say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what ?
This depends on your allocation pattern. Do you free ALL of the small allocations? If so and if the memory manager has handling for a small block allocations, then this may be possible. However, if you allocate many small items and then only free all but a few scattered items, you may fragment memory and make it impossible to TRIM blocks since each block will have only a few straggling allocations. In this case, you may want to use a different allocation scheme for the temporary allocations and the persistant ones so you can return the temporary allocations back to the OS.

Resources