How first-fit allocation algorithm reduce memory fragmentation? - algorithm

I'm reading the Chapter 21 Understanding the Garbage Collector of Real World OCaml.
In the section Memory Allocation Strategies, it says:
First-fit allocation
If your program allocates values of many varied sizes, you may sometimes find that your free list becomes fragmented. In this situation, the GC is forced to perform an expensive compaction despite there being free chunks, since none of the chunks alone are big enough to satisfy the request.
First-fit allocation focuses on reducing memory fragmentation (and hence the number of compactions), but at the expense of slower memory allocation. Every allocation scans the free list from the beginning for a suitable free chunk, instead of reusing the most recent heap chunk as the next-fit allocator does.
I can't figure out how first-fit allocation reduces memory fragmentation compare to next-fit allocation, the only different of these two algorithm is they start the searching from different place.
Material Design Animation - Jobs allocation First Fit & Best Fit
What are the first fit, next fit and best fit algorithms for memory management?

I think the short answer is that Next Fit allocates from blocks throughout the whole free memory region, which means that all blocks are slowly reduced in size. First Fit allocates from as close to the front as possible, so the small blocks concentrate there. Thus the supply of large blocks lasts longer. Since compactions happen where no free block is large enough, First Fit will require fewer compactions.

There is a summary of memory allocation policies and (perhaps) a solution of the memory fragmentation problem for practical programs at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.5185&rep=rep1&type=pdf "The Memory Fragmentation Problem: Solved?" by Johnstone and Wilson. They point out that most work on this problem has been by simulation of memory allocation and deallocation (a point also made by Knuth in Vol 1 Section 2.5). Their contribution is to move from simulation studies based on statistical studies and random number generators to simulation studies based on traces of the memory allocation behaviour of real programs. Under this regime, they find that a variant of best fit tuned for real life behaviour, which uses free lists dedicated to particular memory block sizes for commonly used block sizes, does very well.
So I think your answer is that there is no simple clear answer except for the results of simulation studies, that for common C/C++ programs a variant of best fit can in fact be made to work very well - but if the storage allocation behaviour of OCaml is significantly different from that of C/C++ it is likely that we will only really find out about what is good and bad when somebody runs tests with different allocators using real programs or traces of real programs.

Related

Improving my redblack tree implementation

I wrote a rb-tree implementation. Nodes are allocated using malloc. Is it a good idea to allocate a large table at the beginning and use that space to allocate nodes and doubling the size each time the table is about to overflow. That would make insert operations somewhat faster assuming that the time to allocate is significant which I'm not sure of.
The question of whether it is better to allocate one large block (and split it up on your own) versus allocating lots of small items applies to many situations. And there is not a one-size-fits-all answer for it. In general, though, it would probably be a little bit faster to allocate the large block. But the speedup (if any) may not be large. In my experience, doing the single large allocation typically is worth the effort and complexity in a highly concurrent system that makes heavy use of dynamic allocation. If you have a single-threaded application, my guess is that the allocation of each node makes up a very small cost of the insert operation.
Some general thoughts/comments:
Allocating a single large block (and growing it as needed) will generally use less memory overall. A typical general purpose allocator (e.g., malloc/free in C) has overhead with each allocation. So, for example, a small allocation request of 100 bytes might result in using 128 bytes.
In a memory constrained system with lots of memory fragmentation, it might not be possible to allocate a large block of memory and slice it up whereas multiple small allocations might still succeed.
Although allocating a large block reduces contention for synchronization at the allocator level (e.g., in malloc), it is still necessary to provide your own synchronization when grabbing a node from your own managed list/block (assuming a multi-threaded system). But then there likely has to be some synchronization associated with the insert of the node itself, so it could handled in that same operation.
Ultimately, you would need to test it and measure the difference. One simple thing you could do is just write a simple "throw-away" test that allocates the number of nodes you expect to be handling and just time how long it takes (and then possibly time the freeing of them too). This might give you some kind of ballpark estimate of the allocation costs.

Why would anyone use best fit memory allocation?

I'm reading Modern Operating Systems by Andrew Tanenbaum, and he writes that best fit is a widely used memory allocation algorithm.
He also writes that it's slower than first fit/next fit since it have to search the entire list of allocated memory. And that it tends to waste more memory since it leaves behind a lot of small useless gaps in memory.
Why is it then widely used? Is it some obvious advantage i have overlooked?
First, it's is not that widely used (like all sequential fits), except, perhaps, in homeworks ;). In my opinion, the widely used strategy is segregated fits (which can very closely approximate best fit).
Second, best fit strategy can be implemented by using a tree of free lists of various sizes
Third, it's considered one of the best policies with regard to memory fragmentation
See
Dynamic Storage Allocation: A Survey and Critical Review
The Memory Fragmentation Problem: Solved?
for information about memory management, not Tannenbaum.
I think it's a mischaracterisation to say that it wastes more memory than first fit. Best fit maximizes available space compared to first fit, particularly when it comes to conserving space available for large allocations. This blog post gives a good example.
Space efficiency and versatility is really the answer. Large blocks can fit unknown future needs better than small blocks, so a best-fit algorithm tries to use the smallest blocks first.
First-fit and next-fit algorithms (that can also cut up blocks) may end up using pieces of the larger block first, which increases the risk that a large malloc() will fail. This is essentially harm from large blocks of external fragmentation.
A best-fit algorithm will often find fits that are only a few bytes larger, leading to fragmentation that is only a few bytes, while also saving the large blocks for when they're needed. Also, leaving the large blocks untouched as long as possible helps cache locality and minimizes the load on the MMU, minimizing costly page faults and and saving memory pages for other programs.
A good best-fit algorithm will properly maintain its speed even when it's managing a large number of small fragments, by increasing internal fragmentation (which is hard to reclaim) and/or by using good lookup tables and search trees.
First-fit and next-fit still also face their own searching problems. Without good size indexing in these algorithms, they still have to spend time searching through blocks for one that fits. Since their "standards are lower," they may find a fit faster using a straightforward search, but as soon as you add intelligent indexing, the speeds between all algorithms becomes much closer.
The one I've been using and tweaking for the last 6 years can find the best fit block in O(1) time for >90% of all allocs. It utilizes a handful of strategies to jump straight to the right block, or start very close so searching is minimized. It has, on more than one occasion, replaced existing block-pool or first-fit algorithms due to it's performance and ability to pack allocations more efficiently.
Best fit is not the best allocation strategy, but it is better than first fit and next fit. The reason is because it suffers from less fragmentation problems than the latter two.
Consider a micro heap of 64 bytes. First we fill it by allocating one 32 and two 16 byte blocks in that order. Then we free all blocks. There are now three free blocks in the heap, one 32 byte and two 16 byte ones.
Using first fit, we allocate one 16 byte block. We do it using the 32 byte block (because it is first in the heap!) and the remainder 16 bytes of that block is split into a new free block. So there are one 16 byte allocated block at the beginning of the heap and then three free 16 bytes block.
What happens if we now wants to allocate a 32 byte block? We can't! There are still 48 bytes free in the heap, but fragmentation has screwed us over.
What would have happened if we had used best fit? When we were searching for a free block to use for our 16 byte allocation, we would have skipped over the 32 byte block at the beginning of the heap and instead picked the 16 byte block after it. That would have preserved the 32 byte block for larger allocations.
I suggest you draw it on paper, that makes it very easy to see what goes on with the heap during allocation and freeing.

A "killer adversary" for memory allocators?

After reading this question about seemingly degenerate behavior for the Windows memory allocator, and remembering back to this paper about constructing worst-case inputs to quicksort implementations, I started wondering: would it be possible to build a program that, given a black-box memory allocator, forces that allocator to fail an allocation request even when sufficient memory is still available in the system? That is, is it possible to take a black-box memory allocator and force it to fail?
I know that this can probably be done by allocating and freeing memory in a checkerboard pattern to force massive fragmentation, so in my mind an ideal solution would cause a failure to occur with the fewest total bytes allocated at the time of failure. With respect to the original post that inspired this, it could in theory be possible to cause a failure with zero bytes allocated if the memory allocator has an internal bug.
Any ideas/thoughts on how to do this?
Depends what you mean by "sufficient memory available". For a simple fragmentation "attack":
Make a squillion small allocations until one fails[*].
Now, sort them in order of address[**].
Free 100 alternate allocations.
Attempt to allocate 100*small bytes.
Chances are the allocator will fail to find contiguous memory to satisfy that. If it has a small page size, and plenty of virtual address space compared with physical memory, then it might be able to rearrange things to do it - but that requires capabilities of the MMU on top of any anti-fragmentation strategy by the allocator.
If by "sufficient available memory" you mean a large block of memory that formerly was a contiguous block, has been split up into several allocations all of which have since been freed, and now the allocator treats it as separate blocks and so fails to allocate large bytes then no, I don't think you can force an arbitrary block-box allocator to fail to coalesce blocks. Some allocator or other might do much more work than Windows appears to be doing in that other question, to guarantee that adjacent free blocks are always coalesced.
[*] possible problem - over-committing memory allocators might not fail, you just get a segfault or your process is killed. On such systems you might need to track how much memory is available.
[**] possible problem - in C and C++, operator< isn't guaranteed to work. But on almost all systems it does, and in C++ there's std::less too.

comparison of access performance of data in heap and stack

It is widely known common sense, that for most algorithms, allocating and deallocating data on the stack is much faster than doing so on the heap. In C++, the difference in the code is like
double foo[n*n]
vs.
double* foo = new int[n*n]
But there are any significant differences, when it comes to accessing and calculating with data that lie either on the heap or on the stack? I.e. is there a speed difference for
foo[i]
The code is ought to run on several different architectures, therefore try and measure will not work.
There might be (highly system depending) issues about cache locality and read/write misses. If you run your program on the stack and heap data, then it is conceivable (depending on your cache architecture) that you to run into more cache misses, than if you run it entirely on one continuos region of the stack. Here is the paper to this issue by Andrew Appel (from SML/NJ) and Zhong Shao where they investigate exactly this thing, because stack/heap allocation is a topic for the implementation of functional languages:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.3778
They found some performance problems with write misses but estimated these would be resolved by advances in caching.
So my guess for a contemporary desktop/server machine is that, unless you're running heavily optimized, architecture specific code which streams data along the cache lines, you won't notice any difference between stack and heap accesses. Things might be different for devices with small caches (like ARM/MIPS controller), where ignoring the cache can have noticeable performance effects anyway.
Taken as single statements, it doesn't matter.
Little can be said without more context. There are a few effects in favor of the stack which are negligible virtually all of the time.
the stack is likely in the cache already, a freshly allocated heap block likely is not. However, this is a first execution penalty only. For significant amounts of data, you'd thrash the cache anyway
Stack allocation itself is a bit cheaper than heap allocation, because the allocation is simpler
Long term, the main problem of a heap is usually fragmentation, an "accumulated cost" that (usually) cannot be attributed to single allocations, but may significantly increase the cost of further allocations
Measuring these effects is tricky at least.
Recommendation: performance is not the decider here. Portability and Scalability recommend using the heap for all but very small amount of data.
The stack will be in the CPU cache more often, so that might be faster in some (most?) cases.
But the most precise answer is probably: it depends...
Barring allocation, there should be no discernable difference between accessing data whether it be stack- or heap- based - it's all memory at the end of the day.

Private heap or manage memory self

I know we could take some advantages from creating private heap of Windows especially for frequently allocated and de-allocated small chunks. But I think the normal approach is to allocate a large memory from default heap and manage the allocations and de-allocations ourselves. My question is which way is advantages and disadvantage between those two ways?
Thanks,
Max
Some advantages of managing your own heap:
You might be able to optimize very specifically for your own allocation needs and improve performance.
You may be able to avoid the use of synchronization objects if you know the concurrency rules.
A single free can release an entire set of allocations. For example, a short lived process that needs a bunch of small allocations that are freed all at once could carve them out of a larger block, which can be freed with a single call later.
The disadvantages, though, are very big. The added complexity will produce more bugs, more difficult maintenance, and quite possibly poorer performance in the end. I have absolutely no data to support this, but I suspect that more home-grown heap management systems result in worse performance than help it.
Some advantages of using the system's allocations (e.g., HeapAlloc):
Less complexity.
Reduced risk of concurrency problems in the allocation/freeing.
The ability to take advantage of the Low-Fragmentation Heap. This already does a very good job in most cases of handling the small allocations very efficiently.
Allocating larger chunks is commonly done in pool allocators, where the overhead of allocation and deallocation is reduced and locality is increased (as memory is more likely to be consecutive). More information on pool allocators
In many cases, fragmentation is your worst enemy. When you are allocating, keep object sizes consistent or in sizes (power of two is popular, but may be too wasteful). This reduces fragmentation as there are only a few common sizes of memory which are allocated.

Resources