Private heap or manage memory self - winapi

I know we could take some advantages from creating private heap of Windows especially for frequently allocated and de-allocated small chunks. But I think the normal approach is to allocate a large memory from default heap and manage the allocations and de-allocations ourselves. My question is which way is advantages and disadvantage between those two ways?
Thanks,
Max

Some advantages of managing your own heap:
You might be able to optimize very specifically for your own allocation needs and improve performance.
You may be able to avoid the use of synchronization objects if you know the concurrency rules.
A single free can release an entire set of allocations. For example, a short lived process that needs a bunch of small allocations that are freed all at once could carve them out of a larger block, which can be freed with a single call later.
The disadvantages, though, are very big. The added complexity will produce more bugs, more difficult maintenance, and quite possibly poorer performance in the end. I have absolutely no data to support this, but I suspect that more home-grown heap management systems result in worse performance than help it.
Some advantages of using the system's allocations (e.g., HeapAlloc):
Less complexity.
Reduced risk of concurrency problems in the allocation/freeing.
The ability to take advantage of the Low-Fragmentation Heap. This already does a very good job in most cases of handling the small allocations very efficiently.

Allocating larger chunks is commonly done in pool allocators, where the overhead of allocation and deallocation is reduced and locality is increased (as memory is more likely to be consecutive). More information on pool allocators
In many cases, fragmentation is your worst enemy. When you are allocating, keep object sizes consistent or in sizes (power of two is popular, but may be too wasteful). This reduces fragmentation as there are only a few common sizes of memory which are allocated.

Related

How first-fit allocation algorithm reduce memory fragmentation?

I'm reading the Chapter 21 Understanding the Garbage Collector of Real World OCaml.
In the section Memory Allocation Strategies, it says:
First-fit allocation
If your program allocates values of many varied sizes, you may sometimes find that your free list becomes fragmented. In this situation, the GC is forced to perform an expensive compaction despite there being free chunks, since none of the chunks alone are big enough to satisfy the request.
First-fit allocation focuses on reducing memory fragmentation (and hence the number of compactions), but at the expense of slower memory allocation. Every allocation scans the free list from the beginning for a suitable free chunk, instead of reusing the most recent heap chunk as the next-fit allocator does.
I can't figure out how first-fit allocation reduces memory fragmentation compare to next-fit allocation, the only different of these two algorithm is they start the searching from different place.
Material Design Animation - Jobs allocation First Fit & Best Fit
What are the first fit, next fit and best fit algorithms for memory management?
I think the short answer is that Next Fit allocates from blocks throughout the whole free memory region, which means that all blocks are slowly reduced in size. First Fit allocates from as close to the front as possible, so the small blocks concentrate there. Thus the supply of large blocks lasts longer. Since compactions happen where no free block is large enough, First Fit will require fewer compactions.
There is a summary of memory allocation policies and (perhaps) a solution of the memory fragmentation problem for practical programs at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.5185&rep=rep1&type=pdf "The Memory Fragmentation Problem: Solved?" by Johnstone and Wilson. They point out that most work on this problem has been by simulation of memory allocation and deallocation (a point also made by Knuth in Vol 1 Section 2.5). Their contribution is to move from simulation studies based on statistical studies and random number generators to simulation studies based on traces of the memory allocation behaviour of real programs. Under this regime, they find that a variant of best fit tuned for real life behaviour, which uses free lists dedicated to particular memory block sizes for commonly used block sizes, does very well.
So I think your answer is that there is no simple clear answer except for the results of simulation studies, that for common C/C++ programs a variant of best fit can in fact be made to work very well - but if the storage allocation behaviour of OCaml is significantly different from that of C/C++ it is likely that we will only really find out about what is good and bad when somebody runs tests with different allocators using real programs or traces of real programs.

Why shouldn't we have dynamic allocated memory with different size in embedded system

I have heard in embedded system, we should use some preallocated fixed-size memory chunks(like buddy memory system?). Could somebody give me a detailed explanation why?
Thanks,
In embedded systems you have very limited memory. Therefore, if you occasionally lose only one byte of memory (because you allocate it , but you dont free it), this will eat up the system memory pretty quickly (1 GByte of RAM, with a leak rate of 1/hour will take its time. If you have 4kB RAM, not as long)
Essentially the behaviour of avoiding dynamic memory is to avoid the effects of bugs in your program. As static memory allocation is fully deterministic (while dynamic memory alloc is not), by using only static memory allocation one can counteract such bugs. One important factor for that is that embedded systems are often used in security-critical application. A few hours of downtime could cost millions or an accident could happen.
Furthermore, depending on the dynamic memory allocator, the indeterminism also might take an indeterminate amount of time, which can lead to more bugs especially in systems relying on tight timing (thanks to Clifford for mentioning this). This type of bug is often hard to test and to reproduce because it relies on a very specific execution path.
Additionally, embedded systems don't usually have MMUs, so there is nothing like memory protection. If you run out of memory and your code to handle that condition doesn't work, you could end up executing any memory as instruction (bad things could happen! However this case is only indirectly related to dynamic mem allocation).
As Hao Shen mentioned, fragmentation is also a danger. Whether it may occur depends on your exact usecase, but in embedded systems it is quite easy to loose 50% of your RAM due to fragmentation. You can only avoid fragmentation if you allocate chunks that always have the exact same size.
Performance also plays a role (depends on the usecase - thanks Hao Shen). Statically allocated memory is allocated by the compiler whereas malloc() and similar need to run on the device and therefore consume CPU time (and power).
Many embedded OSs (e.g. ChibiOS) support some kind of dynamic memory allocator. But using it only increases the possibility of unexpected issues to occur.
Note that these arguments are often circumvented by using smaller statically allocated memory pools. This is not a real solution, as one can still run out of memory in those pools, but it will only affect a small part of the system.
As pointed out by Stephano Sanfilippo, some system don't even have enough resources to support dynamic memory allocation.
Note: Most coding standard, including the JPL coding standard and DO-178B (for critical avionics code - thanks Stephano Sanfilippo) forbid the use of malloc.
I also assume the MISRA C standard forbids malloc() because of this forum post -- however I don't have access to the standard itself.
The main reasons not to use dynamic heap memory allocation here are basically:
a) Determinism and, correlated,
b) Memory fragmentation.
Memory leaks are usually not a problem in those small embedded applications, because they will be detected very early in development/testing.
Memory fragmentation can however become non-deterministic, causing (best case) out-of-memory errors at random times and points in the application in the field.
It may also be non-trivial to predict the actual maximum memory usage of the application during development with dynamic allocation, whereas the amount of statically allocated memory is known at compile time and it is absolutely trivial to check if that memory can be provided by the hardware or not.
Allocating memory from a pool of fixed size chunks has a couple advantages over dynamic memory allocation. It prevents heap fragmentation and it is more deterministic.
With dynamic memory allocation, dynamically sized memory chunks are allocated from a fixed size heap. The allocations aren't necessarily freed in the same order that they're allocated. Over time this can lead to a situation where the free portions of the heap are divided up between allocated portions of the heap. As this fragmentation occurs, it can become more difficult to fulfill requests for larger allocations of memory. If a request for a large memory allocation is made, and there is no contiguous free section in the heap that's large enough then the allocation will fail. The heap may have enough total free memory but if it's all fragmented and there is not a contiguous section then the allocation will fail. The possibility of malloc() failing due to heap fragmentation is undesirable in embedded systems.
One way to combat fragmentation is rejoin the smaller memory allocations into larger contiguous sections as they are freed. This can be done in various ways but they all take time and can make the system less deterministic. For example, if the memory manager scans the heap when a memory allocation is freed then the amount of time it takes free() to complete can vary depending on what types of memory are adjacent to the allocation being freed. That is non-deterministic and undesirable in many embedded systems.
Allocating from a pool of fixed sized chunks does not cause fragmentation. So long as there is some free chunks then an allocation won't fail because every chunk is the right size. Plus allocating and freeing from a pool of fixed size chunks is simpler. So the allocate and free functions can be written to be deterministic.

Improving my redblack tree implementation

I wrote a rb-tree implementation. Nodes are allocated using malloc. Is it a good idea to allocate a large table at the beginning and use that space to allocate nodes and doubling the size each time the table is about to overflow. That would make insert operations somewhat faster assuming that the time to allocate is significant which I'm not sure of.
The question of whether it is better to allocate one large block (and split it up on your own) versus allocating lots of small items applies to many situations. And there is not a one-size-fits-all answer for it. In general, though, it would probably be a little bit faster to allocate the large block. But the speedup (if any) may not be large. In my experience, doing the single large allocation typically is worth the effort and complexity in a highly concurrent system that makes heavy use of dynamic allocation. If you have a single-threaded application, my guess is that the allocation of each node makes up a very small cost of the insert operation.
Some general thoughts/comments:
Allocating a single large block (and growing it as needed) will generally use less memory overall. A typical general purpose allocator (e.g., malloc/free in C) has overhead with each allocation. So, for example, a small allocation request of 100 bytes might result in using 128 bytes.
In a memory constrained system with lots of memory fragmentation, it might not be possible to allocate a large block of memory and slice it up whereas multiple small allocations might still succeed.
Although allocating a large block reduces contention for synchronization at the allocator level (e.g., in malloc), it is still necessary to provide your own synchronization when grabbing a node from your own managed list/block (assuming a multi-threaded system). But then there likely has to be some synchronization associated with the insert of the node itself, so it could handled in that same operation.
Ultimately, you would need to test it and measure the difference. One simple thing you could do is just write a simple "throw-away" test that allocates the number of nodes you expect to be handling and just time how long it takes (and then possibly time the freeing of them too). This might give you some kind of ballpark estimate of the allocation costs.

comparison of access performance of data in heap and stack

It is widely known common sense, that for most algorithms, allocating and deallocating data on the stack is much faster than doing so on the heap. In C++, the difference in the code is like
double foo[n*n]
vs.
double* foo = new int[n*n]
But there are any significant differences, when it comes to accessing and calculating with data that lie either on the heap or on the stack? I.e. is there a speed difference for
foo[i]
The code is ought to run on several different architectures, therefore try and measure will not work.
There might be (highly system depending) issues about cache locality and read/write misses. If you run your program on the stack and heap data, then it is conceivable (depending on your cache architecture) that you to run into more cache misses, than if you run it entirely on one continuos region of the stack. Here is the paper to this issue by Andrew Appel (from SML/NJ) and Zhong Shao where they investigate exactly this thing, because stack/heap allocation is a topic for the implementation of functional languages:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.3778
They found some performance problems with write misses but estimated these would be resolved by advances in caching.
So my guess for a contemporary desktop/server machine is that, unless you're running heavily optimized, architecture specific code which streams data along the cache lines, you won't notice any difference between stack and heap accesses. Things might be different for devices with small caches (like ARM/MIPS controller), where ignoring the cache can have noticeable performance effects anyway.
Taken as single statements, it doesn't matter.
Little can be said without more context. There are a few effects in favor of the stack which are negligible virtually all of the time.
the stack is likely in the cache already, a freshly allocated heap block likely is not. However, this is a first execution penalty only. For significant amounts of data, you'd thrash the cache anyway
Stack allocation itself is a bit cheaper than heap allocation, because the allocation is simpler
Long term, the main problem of a heap is usually fragmentation, an "accumulated cost" that (usually) cannot be attributed to single allocations, but may significantly increase the cost of further allocations
Measuring these effects is tricky at least.
Recommendation: performance is not the decider here. Portability and Scalability recommend using the heap for all but very small amount of data.
The stack will be in the CPU cache more often, so that might be faster in some (most?) cases.
But the most precise answer is probably: it depends...
Barring allocation, there should be no discernable difference between accessing data whether it be stack- or heap- based - it's all memory at the end of the day.

Memory Allocation/Deallocation Bottleneck?

How much of a bottleneck is memory allocation/deallocation in typical real-world programs? Answers from any type of program where performance typically matters are welcome. Are decent implementations of malloc/free/garbage collection fast enough that it's only a bottleneck in a few corner cases, or would most performance-critical software benefit significantly from trying to keep the amount of memory allocations down or having a faster malloc/free/garbage collection implementation?
Note: I'm not talking about real-time stuff here. By performance-critical, I mean stuff where throughput matters, but latency doesn't necessarily.
Edit: Although I mention malloc, this question is not intended to be C/C++ specific.
It's significant, especially as fragmentation grows and the allocator has to hunt harder across larger heaps for the contiguous regions you request. Most performance-sensitive applications typically write their own fixed-size block allocators (eg, they ask the OS for memory 16MB at a time and then parcel it out in fixed blocks of 4kb, 16kb, etc) to avoid this issue.
In games I've seen calls to malloc()/free() consume as much as 15% of the CPU (in poorly written products), or with carefully written and optimized block allocators, as little as 5%. Given that a game has to have a consistent throughput of sixty hertz, having it stall for 500ms while a garbage collector runs occasionally isn't practical.
Nearly every high performance application now has to use threads to exploit parallel computation. This is where the real memory allocation speed killer comes in when writing C/C++ applications.
In a C or C++ application, malloc/new must take a lock on the global heap for every operation. Even without contention locks are far from free and should be avoided as much as possible.
Java and C# are better at this because threading was designed in from the start and the memory allocators work from per-thread pools. This can be done in C/C++ as well, but it isn't automatic.
First off, since you said malloc, I assume you're talking about C or C++.
Memory allocation and deallocation tend to be a significant bottleneck for real-world programs. A lot goes on "under the hood" when you allocate or deallocate memory, and all of it is system-specific; memory may actually be moved or defragmented, pages may be reorganized--there's no platform-independent way way to know what the impact will be. Some systems (like a lot of game consoles) also don't do memory defragmentation, so on those systems, you'll start to get out-of-memory errors as memory becomes fragmented.
A typical workaround is to allocate as much memory up front as possible, and hang on to it until your program exits. You can either use that memory to store big monolithic sets of data, or use a memory pool implementation to dole it out in chunks. Many C/C++ standard library implementations do a certain amount of memory pooling themselves for just this reason.
No two ways about it, though--if you have a time-sensitive C/C++ program, doing a lot of memory allocation/deallocation will kill performance.
In general the cost of memory allocation is probably dwarfed by lock contention, algorithmic complexity, or other performance issues in most applications. In general, I'd say this is probably not in the top-10 of performance issues I'd worry about.
Now, grabbing very large chunks of memory might be an issue. And grabbing but not properly getting rid of memory is something I'd worry about.
In Java and JVM-based languages, new'ing objects is now very, very, very fast.
Here's one decent article by a guy who knows his stuff with some references at the bottom to more related links:
http://www.ibm.com/developerworks/java/library/j-jtp09275.html
A Java VM will claim and release memory from the operating system pretty much indepdently of what the application code is doing. This allows it to grab and release memory in large chunks, which is hugely more efficient than doing it in tiny individual operations, as you get with manual memory management.
This article was written in 2005, and JVM-style memory management was already streets ahead. The situation has only improved since then.
Which language boasts faster raw
allocation performance, the Java
language, or C/C++? The answer may
surprise you -- allocation in modern
JVMs is far faster than the best
performing malloc implementations. The
common code path for new Object() in
HotSpot 1.4.2 and later is
approximately 10 machine instructions
(data provided by Sun; see Resources),
whereas the best performing malloc
implementations in C require on
average between 60 and 100
instructions per call (Detlefs, et.
al.; see Resources). And allocation
performance is not a trivial component
of overall performance -- benchmarks
show that many real-world C and C++
programs, such as Perl and
Ghostscript, spend 20 to 30 percent of
their total execution time in malloc
and free -- far more than the
allocation and garbage collection
overhead of a healthy Java
application.
In Java (and potentially other languages with a decent GC implementation) allocating an object is very cheap. In the SUN JVM it only needs 10 CPU Cycles. A malloc in C/c++ is much more expensive, just because it has to do more work.
Still even allocation objects in Java is very cheap, doing so for a lot of users of a web application in parallel can still lead to performance problems, because more Garbage Collector runs will be triggered.
Therefore there are those indirect costs of an allocation in Java caused by the deallocation done by the GC. These costs are difficult to quantify because they depend very much on your setup (how much memory do you have) and your application.
Allocating and releasing memory in terms of performance are relatively costly operations. The calls in modern operating systems have to go all the way down to the kernel so that the operating system is able to deal with virtual memory, paging/mapping, execution protection etc.
On the other side, almost all modern programming languages hide these operations behind "allocators" which work with pre-allocated buffers.
This concept is also used by most applications which have a focus on throughput.
I know I answered earlier, however, that was ananswer to the other answer's, not to your question.
To speak to you directly, if I understand correctly, your performance use case criteria is throughput.
This to me, means's that you should be looking almost exclusivly at NUMA aware allocators.
None of the earlier references; IBM JVM paper, Microquill C, SUN JVM. Cover this point so I am highly suspect of their application today, where, at least on the AMD ABI, NUMA is the pre-eminent memory-cpu governer.
Hands down; real world, fake world, whatever world... NUMA aware memory request/use technologies are faster. Unfortunately, I'm running Windows currently, and I have not found the "numastat" which is available in linux.
A friend of mine has written about this in depth in his implmentation for the FreeBSD kernel.
Dispite me being able to show at-hoc, the typically VERY large amount of local node memory requests on top of the remote node (underscoring the obvious performance throughput advantage), you can surly benchmark yourself, and that would likely be what you need todo as your performance charicterisitc is going to be highly specific.
I do know that in a lot of ways, at least earlier 5.x VMWARE faired rather poorly, at that time at least, for not taking advantage of NUMA, frequently demanding pages from the remote node. However, VM's are a very unique beast when it comes to memory compartmentailization or containerization.
One of the references I cited is to Microsoft's API implmentation for the AMD ABI, which has NUMA allocation specialized interfaces for user land application developers to exploit ;)
Here's a fairly recent analysis, visual and all, from some browser add-on developers who compare 4 different heap implmentations. Naturally the one they developed turns out on top (odd how the people who do the testing often exhibit the highest score's).
They do cover in some ways quantifiably, at least for their use case, what the exact trade off is between space/time, generally they had identified the LFH (oh ya and by the way LFH is simply a mode apparently of the standard heap) or similarly designed approach essentially consumes signifcantly more memory off the bat however over time, may wind up using less memory... the grafix are neat too...
I would think however that selecting a HEAP implmentation based on your typical workload after you well understand it ;) is a good idea, but to well understand your needs, first make sure your basic operations are correct before you optimize these odds and ends ;)
This is where c/c++'s memory allocation system works the best. The default allocation strategy is OK for most cases but it can be changed to suit whatever is needed. In GC systems there's not a lot you can do to change allocation strategies. Of course, there is a price to pay, and that's the need to track allocations and free them correctly. C++ takes this further and the allocation strategy can be specified per class using the new operator:
class AClass
{
public:
void *operator new (size_t size); // this will be called whenever there's a new AClass
void *operator new [] (size_t size); // this will be called whenever there's a new AClass []
void operator delete (void *memory); // if you define new, you really need to define delete as well
void operator delete [] (void *memory);define delete as well
};
Many of the STL templates allow you to define custom allocators as well.
As with all things to do with optimisation, you must first determine, through run time analysis, if memory allocation really is the bottleneck before writing your own allocators.
According to MicroQuill SmartHeap Technical Specification, "a typical application [...] spends 40% of its total execution time on managing memory". You can take this figure as an upper bound, i personally feel that a typical application spends more like 10-15% of execution time allocating/deallocating memory. It rarely is a bottleneck in single-threaded application.
In multithreaded C/C++ applications standard allocators become an issue due to lock contention. This is where you start to look for more scalable solutions. But keep in mind Amdahl's Law.
Pretty much all of you are off base if you are talking about the Microsoft heap. Syncronization is effortlessly handled as is fragmentation.
The current perferrred heap is the LFH, (LOW FRAGMENTATION HEAP), it is default in vista+ OS's and can be configured on XP, via gflag, with out much trouble
It is easy to avoid any locking/blocking/contention/bus-bandwitth issues and the lot with the
HEAP_NO_SERIALIZE
option during HeapAlloc or HeapCreate. This will allow you to create/use a heap without entering into an interlocked wait.
I would reccomend creating several heaps, with HeapCreate, and defining a macro, perhaps, mallocx(enum my_heaps_set, size_t);
would be fine, of course, you need realloc, free also to be setup as appropiate. If you want to get fancy, make free/realloc auto-detect which heap handle on it's own by evaluating the address of the pointer, or even adding some logic to allow malloc to identify which heap to use based on it's thread id, and building a heierarchy of per-thread heaps and shared global heap's/pools.
The Heap* api's are called internally by malloc/new.
Here's a nice article on some dynamic memory management issues, with some even nicer references. To instrument and analyze heap activity.
Others have covered C/C++ so I'll just add a little information on .NET.
In .NET heap allocation is generally really fast, as it it just a matter of just grabbing the memory in the generation zero part of the heap. Obviously this cannot go on forever, which is where garbage collection comes in. Garbage collection may affect the performance of your application significantly since user threads must be suspended during compaction of memory. The fewer full collects, the better.
There are various things you can do to affect the workload of the garbage collector in .NET. Generally if you have a lot of memory reference the garbage collector will have to do more work. E.g. by implementing a graph using an adjacency matrix instead of references between nodes the garbage collector will have to analyze fewer references.
Whether that is actually significant in your application or not depends on several factors and you should profile the application with actual data before turning to such optimizations.

Resources