How to deal with external fragmentation, how paging helps with external fragmentation?

How to deal with external fragmentation, how paging helps with external fragmentation? - memory-management

I know that there is a lot of questions regarding the issue I'm pointing here, but I couldn't find any complex answer (neither on StackOverflow nor in other sources).
I would like to ask about heap (RAM) fragmentation problem.
As I understood there are two kind of fragmentation:
internal - related with difference between allocation unit size (AU) and the size of the allocated memory AM (waste memory is equal to AM % AU),
external - related with noncontinuous areas of a free memory, so even if the sum of the free memory areas can handle the new allocation request, it fails if there is no continues area that can handle it.
This is quite clear. The problems start when the "paging" appears.
Sometimes I can find an information that paging solves the external fragmentation issue.
Indeed I agree that thanks to paging the OS is able to create the virtually continues areas of the memory, assigned to the process, even if physically the parts of the memory are scattered.
But how exactly does it help with the external fragmentation?
I mean, assuming that the size of a page has 4kB, and we want to allocate 16 kB, then of course we just need to find four empty pages frames, even if physically the frames are not a part of a continues area.
But what in case of the smaller allocation ?
I believe the page itself can still be fragmented and (in worst case) the OS still needs to provide a new frame if the old one cannot be used to allocate the requested memory.
So is it that (assuming the worst case) sooner or later, with paging or without, the long working application that allocates and releases the heap memory (different sizes) will fall into low-memory condition, because of external fragmentation ?
So the question is how to deal with the external fragmentation?
Own implementation of allocation algorithm ? Paging (as I wrote, not sure it helps) ? What else ? Does OS (Windows, Linux) provides some defragmentation methods ?
The most radical solution is to forbid using of the heap, but is it really necessary for the platforms with paging, virtual address spaces, virtual memory etc ... and the only issue is that the applications need to run unstoppable for a years ?
One more issue.. is internal fragmentation an ambiguous term ?
Somewhere I have spotted the definition that internal fragmentation points to the part of page frame, that is wasted because the process does not need more memory, but the single frame cannot be owned by more than a single processes.
I have bolded the questions, so the people who are in hurry could find the question without reading everything.
Regards!

"Fragmentation" is indeed not a very precise term. But we can say for sure that when a running application needs a block of n bytes and there are n or more bytes not in use, yet we can't get the required block, then "memory is too fragmented."
But how exactly does it [paging] help with the external allocation [I assume you mean fragmentation] ?
There's really nothing complicated here. External fragmentation is free memory between allocated blocks that's "too small" to satisfy any application requirement. This is a general concept. The definition of "too small" is application-dependent. Nonetheless, if allocated blocks can fall on any boundary, then it's easy, after many allocations and deallocations, for lots of such fragments to occur. Paging helps with external fragmentation in two ways.
First, it subdivides memory into fixed-size adjacent chunks - the pages - that are "large enough" so they're never useless. Again the definition of "large enough" is not precise. But most applications will have lots of requirements satisfiable by a single 4k page. Since no external fragmentation problem can occur for allocations of a page or less, the problem has been mitigated.
Second, the paging hardware provides a level of indirection between application pages and physical memory pages. Therefore any free physical memory page can be used to help satisfy any application request, no matter how large. For example, suppose you have 100 physical pages with every other physical page (50 of them) allocated. Without page-mapping hardware, the biggest request for contiguous memory that can be satisfied is 1 page. With mapping, it's 50 pages. (I'm disregarding virtual pages allocated initially with no mapped physical page. That's another discussion.)
But what in case of the smaller allocation ?
Again it's pretty simple. If the unit of allocation is a page, then any allocation smaller than a page yields an unused portion. This is internal fragmentation: unusable memory within an allocated block. The bigger you make allocation units (they don't have to be a single page), the more memory will be unusable due to internal fragmentation. On average, this will tend toward half of an allocation unit. Consequently, though OS's tend to allocate in units of pages, most application-side memory allocators request a very small number (often one) of big blocks (of pages) from the OS. They use much smaller allocation units internally: 4-16 bytes is pretty common.
So the question is how to deal with the external allocation [I assume you mean fragmentation] ? So is it that (assuming the worst case) sooner or later, with paging or without, the long working application that allocates and releases the heap memory (different sizes) will fall into low-memory condition, because of external fragmentation ?
If I understand you correctly, you're asking if fragmentation is inevitable. Except under very special conditions (e.g. the application only needs blocks of one size), the answer is yes. But that doesn't mean it's necessarily a problem.
Memory allocators use smart algorithms that limit fragmentation pretty effectively. For example, they may maintain "pools" with different block sizes, using the pool with block size most closely matching a given request. This tends to limit both internal and external fragmentation. A real world example that's very well documented is dlmalloc. The source code is also very clear.
Of course any general purpose allocator can fail under specific conditions. For this reason, modern languages (C++ and Ada are two I know) let you supply special-purpose allocators for objects of a given type. Typically -
for a fixed-size object - these might simply maintain a pre-allcoated free list, so fragmentation for that particular case is zero, and allocation/deallocation are very fast.
One more note: It's possible to totally eliminate fragmentation with copying/compacting garbage collection. Of course this requires underlying language support, and there's a performance bill to pay. A copying garbage collector compacts the heap by moving objects to eliminate unused space completely whenever it runs to reclaim storage. To do this it must update every pointer in the running program to the corresponding object's new location. While this may sound complex, I've implemented a copying garbage collector, and it's not so bad. The algorithms are extremely cool. Unfortunately, the semantics of many languages (e.g. C and C++) don't allow finding every pointer in the running program, which is required.
The most radical solution is to forbid using of the heap, but is it really necessary for the platforms with paging, virtual address spaces, virtual memory etc ... and the only issue is that the applications need to run unstoppable for a years ?
Though general purpose allocators are good, they're not guaranteed. It's not unusual for safety-critical or hard real time constrained systems to avoid heap use completely. On the other hand, when no absolute guarantee is needed, a general purpose allocator is often fine. There are many systems that run perfectly with tough loads for extended periods using general purpose allocators: fragmentation reaches an acceptable steady state and doesn't cause a problem.
One more issue.. is internal fragmentation an ambiguous term ?
The term isn't ambiguous, but is used in different contexts. The invariant is that it's referring to unused memory inside allocated blocks.
OS literature tends to assume the allocation unit is pages. For example, Linux sbrk lets you request the end of the data segment be set anywhere, but Linux allocates pages, not bytes, so the unused part of the last page is internal fragmentation from the OS's point of view.
Application-oriented discussions tend to assume allocation is in "blocks" or "chunks" of arbitrary size. dlmalloc uses about 128 discrete chunk sizes, each maintained in its own free list. Plus, it will custom allocate very large blocks using OS memory mapping system calls, so there's at most a page size (minus 1 byte) of mismatch between request and actual allocation. Clearly it's going to a lot of trouble to minimize internal fragmentation. The fragmentation caused a given allocation is the difference between the request and the chunk actually allocated. Since there are so many chunk sizes, that difference is strictly limited. On the other hand, the many chunk sizes increase chances of external fragmentation problems: free memory may consist entirely of chunks that are well-managed by dlmalloc, yet too small to honor an application requirement.

Related

How many block from main memory could be mapped to the same cache block?

What is the best amount of main memory address to map to the same cache block?
I want to give an example to clarify my question.
Assume we want to map 512 KB main memory to 4 KB cache. If we choose to create a direct mapped cache: we are going to map 128 main memory addresses to the same cache block? Is it too much? How to know?

Your question doesn't have enough information to say how many blocks of main memory would map to the same cache line.  What's missing is the block size: number of bytes per block.
As to whether any given cache is sufficient, that is a question of the job, also know as the workload — different workloads/jobs will exhibit different behaviors.  The same job/program with different sets of input can exhibit different behaviors.  Actual experimentation is pretty much the only way to know, but no matter what it will be a judgement call as to whether any given settings are "good enough".
When you have thrashing of cache lines, as is possible with direct mapped cache, that is detrimental to performance, so designers who identify this will try to fix that.
Software designers may chunk the data differently using algorithms to mitigate this given existing hardware they have to work with.
Hardware designers can implement different caching strategies, ranging from larger caches, to 2-, 4-, 8-way set associativity, to fully associative — for the hardware each of these incurs more hardware costs than a direct mapped cache (and may incur performance penalty as well), and further might price the component out of the market (make it too expensive) or exceed the process technology (i.e. be to big for one chip).

Why shouldn't we have dynamic allocated memory with different size in embedded system

I have heard in embedded system, we should use some preallocated fixed-size memory chunks(like buddy memory system?). Could somebody give me a detailed explanation why?
Thanks,

In embedded systems you have very limited memory. Therefore, if you occasionally lose only one byte of memory (because you allocate it , but you dont free it), this will eat up the system memory pretty quickly (1 GByte of RAM, with a leak rate of 1/hour will take its time. If you have 4kB RAM, not as long)
Essentially the behaviour of avoiding dynamic memory is to avoid the effects of bugs in your program. As static memory allocation is fully deterministic (while dynamic memory alloc is not), by using only static memory allocation one can counteract such bugs. One important factor for that is that embedded systems are often used in security-critical application. A few hours of downtime could cost millions or an accident could happen.
Furthermore, depending on the dynamic memory allocator, the indeterminism also might take an indeterminate amount of time, which can lead to more bugs especially in systems relying on tight timing (thanks to Clifford for mentioning this). This type of bug is often hard to test and to reproduce because it relies on a very specific execution path.
Additionally, embedded systems don't usually have MMUs, so there is nothing like memory protection. If you run out of memory and your code to handle that condition doesn't work, you could end up executing any memory as instruction (bad things could happen! However this case is only indirectly related to dynamic mem allocation).
As Hao Shen mentioned, fragmentation is also a danger. Whether it may occur depends on your exact usecase, but in embedded systems it is quite easy to loose 50% of your RAM due to fragmentation. You can only avoid fragmentation if you allocate chunks that always have the exact same size.
Performance also plays a role (depends on the usecase - thanks Hao Shen). Statically allocated memory is allocated by the compiler whereas malloc() and similar need to run on the device and therefore consume CPU time (and power).
Many embedded OSs (e.g. ChibiOS) support some kind of dynamic memory allocator. But using it only increases the possibility of unexpected issues to occur.
Note that these arguments are often circumvented by using smaller statically allocated memory pools. This is not a real solution, as one can still run out of memory in those pools, but it will only affect a small part of the system.
As pointed out by Stephano Sanfilippo, some system don't even have enough resources to support dynamic memory allocation.
Note: Most coding standard, including the JPL coding standard and DO-178B (for critical avionics code - thanks Stephano Sanfilippo) forbid the use of malloc.
I also assume the MISRA C standard forbids malloc() because of this forum post -- however I don't have access to the standard itself.

The main reasons not to use dynamic heap memory allocation here are basically:
a) Determinism and, correlated,
b) Memory fragmentation.
Memory leaks are usually not a problem in those small embedded applications, because they will be detected very early in development/testing.
Memory fragmentation can however become non-deterministic, causing (best case) out-of-memory errors at random times and points in the application in the field.
It may also be non-trivial to predict the actual maximum memory usage of the application during development with dynamic allocation, whereas the amount of statically allocated memory is known at compile time and it is absolutely trivial to check if that memory can be provided by the hardware or not.

Allocating memory from a pool of fixed size chunks has a couple advantages over dynamic memory allocation. It prevents heap fragmentation and it is more deterministic.
With dynamic memory allocation, dynamically sized memory chunks are allocated from a fixed size heap. The allocations aren't necessarily freed in the same order that they're allocated. Over time this can lead to a situation where the free portions of the heap are divided up between allocated portions of the heap. As this fragmentation occurs, it can become more difficult to fulfill requests for larger allocations of memory. If a request for a large memory allocation is made, and there is no contiguous free section in the heap that's large enough then the allocation will fail. The heap may have enough total free memory but if it's all fragmented and there is not a contiguous section then the allocation will fail. The possibility of malloc() failing due to heap fragmentation is undesirable in embedded systems.
One way to combat fragmentation is rejoin the smaller memory allocations into larger contiguous sections as they are freed. This can be done in various ways but they all take time and can make the system less deterministic. For example, if the memory manager scans the heap when a memory allocation is freed then the amount of time it takes free() to complete can vary depending on what types of memory are adjacent to the allocation being freed. That is non-deterministic and undesirable in many embedded systems.
Allocating from a pool of fixed sized chunks does not cause fragmentation. So long as there is some free chunks then an allocation won't fail because every chunk is the right size. Plus allocating and freeing from a pool of fixed size chunks is simpler. So the allocate and free functions can be written to be deterministic.

If we have infinite memory, then do we still be needing paging?

Paging creates illusion that each process has infinite RAM by moving pages to and from disk. So if we have infinite memory(in some hypothetical situation), do we still need Paging? If yes, then why? I faced this question in an interview.

Assuming that "infinite memory" means infinite randomly accessable memory, or RAM, we will still need paging. Although paging is often associated with the ability to swap pages in and out of RAM to a hard disk to conserve memory, this is merely one aspect of paging. Here are some other reasons to have paging:
Security. Paging is a method to enforce operating system security and memory protection by ensuring that a processes cannot access the memory of another process and that it cannot modify the resident kernel.
Multitasking. Paging aids in multitasking by virtualizing the memory space, that is, address 0xFOO in Process A can be something completely different than 0xFOO in Process B
Memory Allocation. Paging aids in memory allocation by reducing fragmentation and ensuring RAM is only allocated when accessed. What this means is that although a process needs, suppose, 100MB of continuous RAM space, this need not be continuous physically. Additionally, when a program requests 100MB of space, the operating system will tell the program it's safe to use that 100MB of space, yet it will not be actually allocated until the program uses that space to its fullest.
Admittedly, the latter would not be entirely necessary if one had infinite RAM, nonetheless; it is always good practice to be eifficient even when we are not resource constrained. It also demonstrates a use for paging that is sometimes not considered.

This is a philosophical question, so here's a philosophical answer :)
The trick in this question is that you make assumptions about the infinite memory. It's ok to say "no, no need to use paging, BUT". And follow by:
The infinite memory has to be accessible within the acceptable time limit for memory access. If it's not (because infinity takes a lot of space, and the memory sits farther away from the processing unit), there's no difference between it and the disk, both are not satisfying the readily available memory requirement, which is what caching via pages tries to solve.
Take for example Amazon's S3, which for all practical purposes is infinite. If you can rely on S3 to satisfy all your memory requirements in the sense that when you need something fetched within time x you can fetch it from S3, there's no need to page anything or even hold it in "local" memory. Just get it from S3 whenever you need it, as many times as you need it. (Obviously this would have other repercussions like cost and network, but let's ignore that for now).
Of course you can always say that optimally you want memory access to be as fast as possible, and "fast enough" is probably slower than "fastest", so local memory access would give you better performance etc.
And last, if I had to envision a memory which is infinite and has the same access time no matter how "far" the memory unit is from the fetching unit, I would have to envision a sphere where the processing unit is in the middle, so that you can't argue that one memory unit is slower than the other because of the distance. Otherwise you could say that paging would be done internally within the memory so that access is faster to the memory units that are most used (or whatever algorithm you choose to use).

What data structure is used to implement the dynamic memory allocation heap?

I always assumed a heap (data structure) is used to implement a heap (dynamic memory allocation), but I've been told I'm wrong.
How are heaps (for example, the one implemented by typical malloc routines, or by Windows's HeapCreate, etc.) implemented, typically? What data structures do they use?
What I'm not asking:
While searching online, I've seen tons of descriptions of how to implement heaps with severe restrictions.
To name a few, I've seen lots of descriptions of how to implement:
Heaps that never release memory back to the OS (!)
Heaps that only give reasonable performance on small, similarly-sized blocks
Heaps that only give reasonable performance for large, contiguous blocks
etc.
And it's funny, they all avoid the harder question:
How are "normal", general-purpose heaps (like the one behind malloc, HeapCreate) implemented?
What data structures (and perhaps algorithms) do they use?

Allocators tend to be quite complex and often differ significantly in how they're implemented.
You can't really describe them in terms of one common data structure or algorithm, but there are some common themes:
Memory is taken from the system in large chunks -- often megabytes at a time.
These chunks are then split up into various smaller chunks as you perform allocations. Not exactly the same size as you allocate, but usually in certain ranges (200-250 bytes, 251-500 bytes, etc.). Sometimes this is multi-tiered, where you'd have an additional layer of "medium chunks" which come before your actual requests.
Controlling which "large chunk" to break a piece off of is a very difficult and important thing to do -- this greatly affects memory fragmentation.
One or more free pools (aka "free list", "memory pool", "lookaside list") are maintained for each of these ranges. Sometimes even thread-local pools. This can greatly speed up a pattern of allocating/deallocating many objects of similar size.
Large allocations are treated a bit differently so as to not waste a lot of RAM and not be pooled quite so much if at all.
If you wanted to check out some source code, jemalloc is a modern high-performance allocator and should be representative in complexity of other common ones. TCMalloc is another common general-purpose allocator, and their website goes into all the gory implementation details. Intel's Thread Building Blocks has an allocator built specifically for high concurrency.
One interesting difference can be seen between Windows and *nix. In *nix, the allocator has very low-level control over the address space an app uses. In Windows, you basically have a course-grained, slow allocator VirtualAlloc to base your own allocator off of.
This results in *nix-compatible allocators typically directly giving you an malloc/free implementation where it's assumed you'll only use one allocator for everything (otherwise they'd trample each-other), while Windows-specific allocators provide additional functions, leaving malloc/free alone, and can be used in harmony (for instance, you can use HeapCreate to make private heaps which can work alongside others).
In practice, this trade in flexibility gives *nix allocators a small leg up performance-wise. It's very rare to see an app intentionally use multiple heaps on Windows -- mostly it's by accident due to different DLLs using different runtimes which each have their own malloc/free, and can cause a lot of headaches if you're not diligent in tracking which heap some memory came from.

Note: The following answer assumes you're using a typical, modern system with virtual memory. The C and C++ standards do not require virtual memory; therefore of course you can't rely on such assumptions on hardware without this feature (e.g. GPUs typically don't have this feature; nor do extremely small hardware like the PIC).
This depends on the platform you're using. Heaps can be very complicated beasts; they don't use only a single data structure; and there is no "standard" data structure. Even where the heap code is located is different depending on the platform. For instance, the heap code is typically provided by the C Runtime on Unix boxes; but is typically provided by the operating system on Windows.
Yes, this is common on Unix machines; due to the way *nix's underlying APIs and memory model operate. Basically, the standard API to return memory to the operating system on these systems only allows returning memory on the "frontier" between where user memory is allocated and the "hole" in between user memory and system facilities like the stack. (The API in question is brk or sbrk). Instead of returning memory to the operating system, many heaps only try to reuse memory no longer in use by the program proper, and don't try to return memory to the system. This is less common on Windows, because its equivalent to sbrk (VirtualAlloc) doesn't have this limitation. (But like sbrk, it is very expensive and has caveats like only allocating page-sized and page-aligned chunks. So heaps try to call either as rarely as possible)
This sounds like a "block allocator", which divides the memory into fixed size chunks, and then just return one of the free chunks. To my (albeit limited) understanding, Windows' RtlHeap maintains a number of data structures like this for different known block sizes. (E.g. it'll have one for blocks of size 16, for instance) RtlHeap calls these "lookaside lists".
I don't really know of a specific structure that handles this case well. Large blocks are problematic for most allocation systems because they cause fragmentation of the address space.
The best reference I've found discussing the common allocation strategies employed on major platforms is the book Secure Coding in C and C++, by Robert Seacord. All of chapter 4 is dedicated to heap data structures (and problems caused when users use said heap systems incorrectly).

A "killer adversary" for memory allocators?

After reading this question about seemingly degenerate behavior for the Windows memory allocator, and remembering back to this paper about constructing worst-case inputs to quicksort implementations, I started wondering: would it be possible to build a program that, given a black-box memory allocator, forces that allocator to fail an allocation request even when sufficient memory is still available in the system? That is, is it possible to take a black-box memory allocator and force it to fail?
I know that this can probably be done by allocating and freeing memory in a checkerboard pattern to force massive fragmentation, so in my mind an ideal solution would cause a failure to occur with the fewest total bytes allocated at the time of failure. With respect to the original post that inspired this, it could in theory be possible to cause a failure with zero bytes allocated if the memory allocator has an internal bug.
Any ideas/thoughts on how to do this?

Depends what you mean by "sufficient memory available". For a simple fragmentation "attack":
Make a squillion small allocations until one fails[*].
Now, sort them in order of address[**].
Free 100 alternate allocations.
Attempt to allocate 100*small bytes.
Chances are the allocator will fail to find contiguous memory to satisfy that. If it has a small page size, and plenty of virtual address space compared with physical memory, then it might be able to rearrange things to do it - but that requires capabilities of the MMU on top of any anti-fragmentation strategy by the allocator.
If by "sufficient available memory" you mean a large block of memory that formerly was a contiguous block, has been split up into several allocations all of which have since been freed, and now the allocator treats it as separate blocks and so fails to allocate large bytes then no, I don't think you can force an arbitrary block-box allocator to fail to coalesce blocks. Some allocator or other might do much more work than Windows appears to be doing in that other question, to guarantee that adjacent free blocks are always coalesced.
[*] possible problem - over-committing memory allocators might not fail, you just get a segfault or your process is killed. On such systems you might need to track how much memory is available.
[**] possible problem - in C and C++, operator< isn't guaranteed to work. But on almost all systems it does, and in C++ there's std::less too.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio