Why would anyone use best fit memory allocation? - memory-management

I'm reading Modern Operating Systems by Andrew Tanenbaum, and he writes that best fit is a widely used memory allocation algorithm.
He also writes that it's slower than first fit/next fit since it have to search the entire list of allocated memory. And that it tends to waste more memory since it leaves behind a lot of small useless gaps in memory.
Why is it then widely used? Is it some obvious advantage i have overlooked?

First, it's is not that widely used (like all sequential fits), except, perhaps, in homeworks ;). In my opinion, the widely used strategy is segregated fits (which can very closely approximate best fit).
Second, best fit strategy can be implemented by using a tree of free lists of various sizes
Third, it's considered one of the best policies with regard to memory fragmentation
See
Dynamic Storage Allocation: A Survey and Critical Review
The Memory Fragmentation Problem: Solved?
for information about memory management, not Tannenbaum.

I think it's a mischaracterisation to say that it wastes more memory than first fit. Best fit maximizes available space compared to first fit, particularly when it comes to conserving space available for large allocations. This blog post gives a good example.

Space efficiency and versatility is really the answer. Large blocks can fit unknown future needs better than small blocks, so a best-fit algorithm tries to use the smallest blocks first.
First-fit and next-fit algorithms (that can also cut up blocks) may end up using pieces of the larger block first, which increases the risk that a large malloc() will fail. This is essentially harm from large blocks of external fragmentation.
A best-fit algorithm will often find fits that are only a few bytes larger, leading to fragmentation that is only a few bytes, while also saving the large blocks for when they're needed. Also, leaving the large blocks untouched as long as possible helps cache locality and minimizes the load on the MMU, minimizing costly page faults and and saving memory pages for other programs.
A good best-fit algorithm will properly maintain its speed even when it's managing a large number of small fragments, by increasing internal fragmentation (which is hard to reclaim) and/or by using good lookup tables and search trees.
First-fit and next-fit still also face their own searching problems. Without good size indexing in these algorithms, they still have to spend time searching through blocks for one that fits. Since their "standards are lower," they may find a fit faster using a straightforward search, but as soon as you add intelligent indexing, the speeds between all algorithms becomes much closer.
The one I've been using and tweaking for the last 6 years can find the best fit block in O(1) time for >90% of all allocs. It utilizes a handful of strategies to jump straight to the right block, or start very close so searching is minimized. It has, on more than one occasion, replaced existing block-pool or first-fit algorithms due to it's performance and ability to pack allocations more efficiently.

Best fit is not the best allocation strategy, but it is better than first fit and next fit. The reason is because it suffers from less fragmentation problems than the latter two.
Consider a micro heap of 64 bytes. First we fill it by allocating one 32 and two 16 byte blocks in that order. Then we free all blocks. There are now three free blocks in the heap, one 32 byte and two 16 byte ones.
Using first fit, we allocate one 16 byte block. We do it using the 32 byte block (because it is first in the heap!) and the remainder 16 bytes of that block is split into a new free block. So there are one 16 byte allocated block at the beginning of the heap and then three free 16 bytes block.
What happens if we now wants to allocate a 32 byte block? We can't! There are still 48 bytes free in the heap, but fragmentation has screwed us over.
What would have happened if we had used best fit? When we were searching for a free block to use for our 16 byte allocation, we would have skipped over the 32 byte block at the beginning of the heap and instead picked the 16 byte block after it. That would have preserved the 32 byte block for larger allocations.
I suggest you draw it on paper, that makes it very easy to see what goes on with the heap during allocation and freeing.

Related

What is the most important feature of caches in scientific computing?

I have started recently learning parallel programming techniques and what to give attention to when trying to create efficient programs. For example knowing specific details about the caches of your processor is essential if you want to write efficient programs.
I want to know what is the most important (if one is more important than the other) feature of a cache between the block size and the number of sets e.g. 4-way or 8-way associative.
Associativity matters more than line size. Many accesses in HPC are sequential, so smaller line size is mostly just a waste of tag overhead.
Having more smaller sets (because of a smaller line size) might help for a histogram problem, which is one of the major things that can't easily be optimized to sequential accesses.
Of course, latency and bandwidth are usually even more important than 4 vs. 8-way.

How first-fit allocation algorithm reduce memory fragmentation?

I'm reading the Chapter 21 Understanding the Garbage Collector of Real World OCaml.
In the section Memory Allocation Strategies, it says:
First-fit allocation
If your program allocates values of many varied sizes, you may sometimes find that your free list becomes fragmented. In this situation, the GC is forced to perform an expensive compaction despite there being free chunks, since none of the chunks alone are big enough to satisfy the request.
First-fit allocation focuses on reducing memory fragmentation (and hence the number of compactions), but at the expense of slower memory allocation. Every allocation scans the free list from the beginning for a suitable free chunk, instead of reusing the most recent heap chunk as the next-fit allocator does.
I can't figure out how first-fit allocation reduces memory fragmentation compare to next-fit allocation, the only different of these two algorithm is they start the searching from different place.
Material Design Animation - Jobs allocation First Fit & Best Fit
What are the first fit, next fit and best fit algorithms for memory management?
I think the short answer is that Next Fit allocates from blocks throughout the whole free memory region, which means that all blocks are slowly reduced in size. First Fit allocates from as close to the front as possible, so the small blocks concentrate there. Thus the supply of large blocks lasts longer. Since compactions happen where no free block is large enough, First Fit will require fewer compactions.
There is a summary of memory allocation policies and (perhaps) a solution of the memory fragmentation problem for practical programs at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.5185&rep=rep1&type=pdf "The Memory Fragmentation Problem: Solved?" by Johnstone and Wilson. They point out that most work on this problem has been by simulation of memory allocation and deallocation (a point also made by Knuth in Vol 1 Section 2.5). Their contribution is to move from simulation studies based on statistical studies and random number generators to simulation studies based on traces of the memory allocation behaviour of real programs. Under this regime, they find that a variant of best fit tuned for real life behaviour, which uses free lists dedicated to particular memory block sizes for commonly used block sizes, does very well.
So I think your answer is that there is no simple clear answer except for the results of simulation studies, that for common C/C++ programs a variant of best fit can in fact be made to work very well - but if the storage allocation behaviour of OCaml is significantly different from that of C/C++ it is likely that we will only really find out about what is good and bad when somebody runs tests with different allocators using real programs or traces of real programs.

Which of a misaligned store and misaligned load is more expensive?

Suppose I'm copying data between two arrays that are 1024000+1 bytes apart. Since the offset is not a multiple of word size, I'll need to do some misaligned accesses - either loads or stores (for the moment, let's forget that it's possible to avoid misaligned accesses entirely with some ORing and bit shifting). Which of misaligned loads or misaligned stores will be more expensive?
This is a hypothetical situation, so I can't just benchmark it :-) I'm more interested in what factors will lead to performance difference, if any. A pointer to some further reading would be great.
Thanks!
A misaligned write will need to read two destination words, merge in the new data, and write two words. This would be combined with an aligned read. So, 3R + 2W.
A misaligned read will need to read two source words, and merge the data (shift and bitor). This would be combined with an aligned write. So, 2R + 1W.
So, the misaligned read is a clear winner.
Of course, as you say there are more efficient ways to do this that avoid any mis-aligned operations except at the ends of the arrays.
Actually that depends greatly on the CPU you are using. On newer Intel CPUs there is no penalty for loading and storing unaligned words (at least none that you can notice). Only if you load and store 16byte or 32byte unaligned chunks you may see small performance degradation.
How much data? are we talking about two things unaligned at the ends of a large block of data (in the noise) or one item (word, etc) that is unaligned (100% of the data)?
Are you using a memcpy() to move this data, etc?
I'm more interested in what factors will lead to performance
difference, if any.
Memories, modules, chips, on die blocks, etc are usually organized with a fixed access size, at least somewhere along the way there is a fixed access size. Lets just say 64 bits wide, not an uncommon size these days. So at that layer wherever it is you can only write or read in aligned 64 bit units.
If you think about a write vs read, with a read you send out an address and that has to go to the memory and data come back, a full round trip has to happen. With a write everything you need to know to perform the write goes on the outbound path, so it is not uncommon to have a fire and forget type deal where the memory controller takes the address and data and tells the processor the write has finished even though the information has not net reached the memory. It does take time but not as long as a read (not talking about flash/proms just ram here) since a read requires both paths. So for aligned full width stuff a write CAN BE faster, some systems may wait for the data to make it all the way to the memory and then return a completion which is perhaps about the same amount of time as the read. It depends on your system though, the memory technology can make one or the other faster or slower right at the memory itself. Now the first write after nothing has been happening can do this fire and forget thing, but the second or third or fourth or 16th in a row eventually fills up a buffer somewhere along the path and the processor has to wait for the oldest one to make it all the way to the memory before the most recent one has a place in the queue. So for bursty stuff writes may be faster than reads but for large movements of data they approach each other.
Now alignment. The whole memory width will be read on a read, in this case lets say 64 bits, if you were only really interested in 8 of those bits, then somewhere between the memory and the processor the other 24 bits are discarded, where depends on the system. Writes that are not a whole, aligned, size of the memory mean that you have to read the width of the memory, lets say 64 bits, modify the new bits, say 8 bits, then write the whole 64 bits back. A read-modify-write. A read only needs a read a write needs a read-modify-write, the farther away from the memory requiring the read modify write the longer it takes the slower it is, no matter what the read-modify-write cant be any faster than the read alone so the read will be faster, the trimming of bits off the read generally wont take any time so reading a byte compared to reading 16 bits or 32 or 64 bits from the same location so long as the busses and destination are that width all the way, take the same time from the same location, in general, or should.
Unaligned simply multiplies the problem. Say worst case if you want to read 16 bits such that 8 bits are in one 64 bit location and the other 8 in the next 64 bit location, you need to read 128 bits to satisfy that 16 bit read. How that exactly happens and how much of a penalty is dependent on your system. some busses set up the transfer X number of clocks but the data is one clock per bus width after that so a 128 bit read might be only one clock longer (than the dozens to hundreds) of clocks it takes to read 64, or worst case it could take twice as long in order to get the 128 bits needed for this 16 bit read. A write, is a read-modify-write so take the read time, then modify the two 64 bit items, then write them back, same deal could be X+1 clocks in each direction or could be as bad as 2X number of clocks in each direction.
Caches help and hurt. A nice thing about using caches is that you can smooth out the transfers to the slow memory, you can let the cache worry about making sure all memory accesses are aligned and all writes are whole 64 bit writes, etc. How that happens though is the cache will perform same or larger sized reads. So reading 8 bits may result in one or many 64 bit reads of the slow memory, for the first byte, if you perform a second read right after that of the next byte location and if that location is in the same cache line then it doesnt go out to slow memory, it reads from the cache, much faster. and so on until you cross over into another cache boundary or other reads cause that cache line to be evicted. If the location being written is in cache then the read-modify-write happens in the cache, if not in cache then it depends on the system, a write doesnt necessarily mean the read modify write causes a cache line fill, it could happen on the back side as of the cache were not there. Now if you modified one byte in the cache line, now that line has to be written back it simply cannot be discarded so you have a one to few widths of the memory to write back as a result. your modification was fast but eventually the write happens to the slow memory and that affects the overall performance.
You could have situations where you do a (byte) read, the cache line if bigger than the external memory width can make that read slower than if the cache wasnt there, but then you do a byte write to some item in that cache line and that is fast since it is in the cache. So you might have experiments that happen to show writes are faster.
A painful case would be reading say 16 bits unaligned such that not only do they cross over a 64 bit memory width boundary but the cross over a cache line boundary, such that two cache lines have to be read, instead of reading 128 bits that might mean 256 or 512 or 1024 bits have to be read just to get your 16.
The memory sticks on your computer for example are actually multiple memories, say maybe 8 8 bit wide to make a 64 bit overall width or 16 4 bit wide to make an overall 64 bit width, etc. That doesnt mean you can isolate writes on one lane, but maybe, I dont know those modules very well but there are systems where you can/could do this, but those systems I would consider to be 8 or 4 bit wide as far as the smallest addressable size not 64 bit as far as this discussion goes. ECC makes things worse though. First you need an extra memory chip or more, basically more width 72 bits to support 64 for example. You must do full writes with ECC as the whole 72 bits lets say has to be self checking so you cant do fractions. if there is a correctable (single bit) error the read suffers no real penalty it gets the corrected 64 bits (somewhere in the path where this checking happens). Ideally you want a system to write back that corrected value but that is not how all systems work so a read could turn into a read modify write, aligned or not. The primary penalty is if you were able to do fractional writes you cant now with ECC has to be whole width writes.
Now to my question, lets say you use memcpy to move this data, many C libraries are tuned to do aligned transfers, at least where possible, if the source and destination are unaligned in a different way that can be bad, you might want to manage part of the copy yourself. say they are unaligned in the same way, the memcpy will try to copy the unaligned bytes first until it gets to an aligned boundary, then it shifts into high gear, copying aligned blocks until it gets near the end, it downshifts and copies the last few bytes if any, in an unaligned fashion. so if this memory copy you are talking about is thousands of bytes and the only unaligned stuff is near the ends then yes it will cost you some extra reads as much as two extra cache line fills, but that may be in the noise. Even on smaller sizes even if aligned on say 32 bit boundaries if you are not moving whole cache lines or whole memory widths there may still be an extra cache line involved, aligned or not you might only suffer an extra cache lines worth of reading and later writing...
The pure traditional, non-cached memory view of this, all other things held constant is as Doug wrote. Unaligned reads across one of these boundaries, like the 16 bits across two 64 bit words, costs you an extra read 2R vs 1R. A similar write costs you 2R+2W vs 1W, much more expensive. Caches and other things just complicate the problem greatly making the answer "it depends"...You need to know your system pretty well and what other stuff is going on around it, if any. Caches help and hurt, with any cache a test can be crafted to show the cache makes things slower and with the same system a test can be written to show the cache makes things faster.
Further reading would be go look at the databooks/sheets technical reference manuals or whatever the vendor calls their docs for various things. for ARM get the AXI/AMBA documentation on their busses, get the cache documentation for their cache (PL310 for example). Information on ddr memory the individual chips used in the modules you plug into your computer are all out there, lots of timing diagrams, etc. (note just because you think you are buying gigahertz memory, you are not, dram has not gotten faster in like 10 years or more, it is pretty slow around 133Mhz, it is just that the bus is faster and can queue more transfers, it still takes hundreds to thousands of processor cycles for a ddr memory cycle, read one byte that misses all the caches and you processor waits an eternity). so memory interfaces on the processors and docs on various memories, etc may help, along with text books on caches in general, etc.

Improving my redblack tree implementation

I wrote a rb-tree implementation. Nodes are allocated using malloc. Is it a good idea to allocate a large table at the beginning and use that space to allocate nodes and doubling the size each time the table is about to overflow. That would make insert operations somewhat faster assuming that the time to allocate is significant which I'm not sure of.
The question of whether it is better to allocate one large block (and split it up on your own) versus allocating lots of small items applies to many situations. And there is not a one-size-fits-all answer for it. In general, though, it would probably be a little bit faster to allocate the large block. But the speedup (if any) may not be large. In my experience, doing the single large allocation typically is worth the effort and complexity in a highly concurrent system that makes heavy use of dynamic allocation. If you have a single-threaded application, my guess is that the allocation of each node makes up a very small cost of the insert operation.
Some general thoughts/comments:
Allocating a single large block (and growing it as needed) will generally use less memory overall. A typical general purpose allocator (e.g., malloc/free in C) has overhead with each allocation. So, for example, a small allocation request of 100 bytes might result in using 128 bytes.
In a memory constrained system with lots of memory fragmentation, it might not be possible to allocate a large block of memory and slice it up whereas multiple small allocations might still succeed.
Although allocating a large block reduces contention for synchronization at the allocator level (e.g., in malloc), it is still necessary to provide your own synchronization when grabbing a node from your own managed list/block (assuming a multi-threaded system). But then there likely has to be some synchronization associated with the insert of the node itself, so it could handled in that same operation.
Ultimately, you would need to test it and measure the difference. One simple thing you could do is just write a simple "throw-away" test that allocates the number of nodes you expect to be handling and just time how long it takes (and then possibly time the freeing of them too). This might give you some kind of ballpark estimate of the allocation costs.

Best heuristic for malloc

Consider using malloc() to allocate x bytes of memory in a fragmented heap. Assume the heap has multiple contiguous locations of size greater than x bytes.
Which is the best (that leads to least heap wastage) heuristic to choose a location among the following?
Select smallest location that is bigger than x bytes.
Select largest location that is bigger than x bytes.
My intuition is smallest location that is bigger than x bytes. I am not sure which is the best in practice.
No, this is not any assignment question. I was reading this How do malloc() and free() work? and this looks like a good follow up question to ask.
In a generic heap where allocations of different sizes are mixed, then of the two I'd go for putting the allocation in the smallest block that can accomodate it (to avoid reducing the size of the largest block we can allocate before we need to).
There are other ways of implementing a heap however that would make this question less relevant (such as the popular dlmalloc by Doug Lea - where it pools blocks of similar sizes to improve speed and reduce overall fragmentation).
Which solution is best always comes down to the way the application is performing its memory allocations. If you know an applications pattern in advance you should be able to beat the generic heaps both in size and speed.
It's better to select the smallest location. Think about future malloc requests. You don't know what they'll be, and you want to satisfy as many requests as you can. So it's better to find a location that exactly fits your needs, so that bigger requests can be satisfied in the future. In other words, selecting the smallest location reduces fragmentation.
The heuristics you listed are used in the Best Fit and Worst Fit algorithms, respectively. There is also the First Fit algorithm which simply takes the first space it finds that is large enough. It is approximately as good as Best Fit, and much faster.

Resources