HeapCreate vs GetProcessHeap - winapi

I'm new to using Heap Allocation in C++.
I'm tryin to understand the scenario that will force someone to create a private heap instead of using the Process Heap. Isn't Process Heap generally enough for most of the cases?
Thanks
--Ashish

If you have a flurry of transient heap activity, using a private heap for that can be faster than churning on the process heap. If you start a thread and give it a private heap, it can be thread-safe in those heap operation without needing to deal with locking for them. There are other reasons, but these two are relatively common ones.

It is an easy way of using a memory pool, which is useful especially on deallocation: Instead of tracking the lifetime of many small objects and deleting them one after another, create a seperate heap for them and destroy the entire heap when you're done.

Related

What is the purpose of creating a private heap

One can use the HeapCreate function to allocate a private heap from the calling process on Windows platform. While each process has its own default heap.
My question is: What are some possible reasons that a programmer use a private heap than the default one? In other words, in what scenario, using a private heap will become really handy?
There is a whole long list of reasons for creating multiple heaps:
Better efficiency with threads (threads do not share heaps).
Debugging and error trapping
Setting up heaps with different allocation properties.

Why does windows allow to create private heap?

I am learning memory managment in Windows. I know that process in windows has by default its heap, that can be extended in future. Also process can create additional (private) heaps. Why does windows allow to create private heap? What is benefit of such approach? As I understand usage of default heap (with possible reallocations) is enough. Or maybe is it another way to optimize reallocations?
If you look at HeapCreate you will see that it has multiple options that changes how the heap works. HEAP_NO_SERIALIZE will make it faster but you have to handle thread synchronization on your own etc.
Having multiple heaps can also be beneficial if you allocate objects of different sizes with different lifetimes. You might want to put large long-living objects on their own heap if you also have a high churn of small objects that are allocated and de-allocated as part of your work to reduce fragmentation (and lock contention if you are multithreaded).
As noted in a comment, you can call HeapDestroy to free every allocation and the heap itself in one call but this only makes sense if you have full control over everything allocated there. You are not allowed to destroy the default heap so you must create your own private heap to use this trick.

What is the main performance gain from garbage collection?

The llvm documentation says:
In practice, however, the locality and performance benefits of using aggressive garbage collection techniques dominates any low-level losses.
So what is it, exactly, that causes the performance gain when using garbage collection as opposed to manually managing memory? (besides the obvious decrease in code writing time) Is the benefit solely that performing heap compaction increases spatial locality and cache utilization? Or is there something else that helps more, like deleting everything at once?
On modern processors the memory caches are King. Suffering a cache miss can stall the processor for hundreds of cpu cycles, waiting for the slow bus to supply the data.
Making the caches effective requires locality of reference. In other words, if the next memory access is close to the previous one then the odds that the data is already in the cache are high.
A garbage collector can help a lot to make that work out well. The big win is not the collection, it is its ability to rebuild the object graph and reorganize the data structure while doing so. Compacting.
Imagine the typical data structure, an array of pointers to objects. Which is slowly being built up while, say, reading a bunch of strings from a file and turning them into field values of an object. Allocated objects will be scatter-shot in the address space doing so. Long lived objects pointed-to by the array separated by the worker objects, like strings. Iterating that array later is going to be pretty slow.
Until the garbage collector runs and rebuilds the data structure. Putting all of the pointed-to objects in order.
Now iterating the collection is very fast, since accessing element N makes it very likely that element N+1 is readily available. If not in the L1 cache then very good odds for L2 or L3 (if you have it).
Very big win, it is the one feature that made garbage collection competitive with explicit memory management. With the explicit kind having the problem of not supporting moving objects because it will invalidate a pointer.
I can only speak for the Oracle (ex-Sun) and IBM JVMs; their efficiency relies on the fact that newly-created objects are unlikely to live very long. So segregating them into their own area allows that area to be frequently compacted, since with few survivors that's a cheap operation. Frequent compaction means that free space can be kept contiguous, so object creation is also cheap because there's no free chain to traverse and no memory fragmentation.
Manual memory management schemes are rarely this efficient because this is a relatively complex way of doing things that is unlikely to be reinvented for each application. These garbage collectors have evolved and been optimised over a longer period and with more effort than individual applications ever receive. It would be surprising and disappointing if they weren't much more performant.
I doubt locality helps performance at all - admittedly small objects tend to be created at the same time in the same area of the heap (but this applies to C as well), over time, these small objects that remain will be compacted into a closely related area of the heap and it is supposedly this that give you an advantage over C-style allocations. However, show me a program that uses just these small objects and I'll show you a program that does sod all. Show me a program that passes all objects that are to be used on the stack and I'll show you one that screams with speed.
The de-allocation of memory is a performance benefit, short-term as they do not need to be de-allocated. However, when the garbage collector does kick in, this benefit disappears. Usually though, the collection occurs when nothing else is happening in the system (theoretically) so the cost is effectively nullified.
Compaction of the heap also helps allocation, all allocations can come from the beginning of the heap, and the memory manager doesn't have to walk the heap looking for the next free space block of the right size. However, traditional systems can gain the same amount of speed by using multiple fixed-block heaps (which mean you always allocate from a heap for the size of block you want, and you always allocate a fixed block, so walking the heap is just to find the first free block, and this can be removed using a bitmap)
So all in all, there isn't much of a benefit at all, except in benchmarks of course. In my experience the GC can and will jump in and slow you down dramatically at just the wrong time, usually when the system memory is getting filled because the user has done something like load a new page that required a lot of memory allocations.... which in turn required a collection.
It also has a tendency to use a lot of memory - 'memory is cheap' is the mantra of GC languages, so programs are written with this in mind, which means memory allocations are much more common, especially for temporaries and intermediate objects. Just look to StringBuilder classes for the evidence that this is well known. Strings may be 'solved' using this, but many other objects are still allocated with wild abandon. Any program that uses a lot of memory will find itself struggling with RAM IO - all that memory has to be brought into the CPU caches to be used, the more memory you use, the more IO your CPU MM will have to do and that can kill performance in the wrong circumstances.
In addition, when a GC occurs, you have to handle Finalised objects too, this isn't quite as bad as it used to be, but it can still halt your program while the finalisers are run.
Old Java GCs were dreadful for perf, though a lot of research has made them significantly better, they are still not perfect.
EDIT:
one more thing about localisation, imagine creating an array and adding a few items, then do a load of allocations, then you want to add another item to the array - with a GC system the added array element will not be localised, even after a compaction, each object in the array will be stored as an individual item on the heap. This is why I think the localisation issue is not as big a deal as it's made out to be. Now, compare that to an array that is allocated with a buffer and objects are allocated within the buffer space. That may require a re-alloc and copy to add a new item, but reading and modifying it is super fast.
One factor not yet mentioned is that, especially in multi-threaded systems, it can sometimes be difficult to predict with certainty what object will end up holding the last surviving reference to some other object. If one doesn't have to worry about object graphs that might contain cycles, it's possible to use reference counts for this purpose. Before copying a reference to an object, increment its reference count. Before destroying a reference to an object, decrement its reference count. It decrementing the reference count makes it hit zero, destroy the object as well as the reference. Such an approach works well on computers with only one CPU core; if only one thread can actually be running at any given time, one doesn't have to worry about what will happen if two threads try to adjust the same object's reference count simultaneously. Unfortunately, in systems with multiple CPU cores, any CPU that wants to adjust a reference count would have to coordinate that action with all the other CPUs to ensure that two CPUs never hit the counter at the exact same time. Such coordination is "free" with a single CPU, but is relatively expensive in multi-core systems.
When using a batch-mode garbage collector, object references may generally be freely assigned, copied, and destroyed, without inter-CPU coordination. It will periodically be necessary to have all the CPUs stop and run a garbage-collection cycle, but requiring all the CPUs to coordinate with each other once every few seconds or so is a lot cheaper than requiring them to coordinate with each other on every single object-reference assignment.

Garbage collection and memory management in Erlang

I want to know technical details about garbage collection (GC) and memory management in Erlang/OTP.
But, I cannot find on erlang.org and its documents.
I have found some articles online which talk about GC in a very general manner, such as what garbage collection algorithm is used.
To classify things, lets define the memory layout and then talk about how GC works.
Memory Layout
In Erlang, each thread of execution is called a process. Each process has its own memory and that memory layout consists of three parts: Process Control Block, Stack and Heap.
PCB: Process Control Block holds information like process identifier (PID), current status (running, waiting), its registered name, and other such info.
Stack: It is a downward growing memory area which holds incoming and outgoing parameters, return addresses, local variables and temporary spaces for evaluating expressions.
Heap: It is an upward growing memory area which holds process mailbox messages and compound terms. Binary terms which are larger than 64 bytes are NOT stored in process private heap. They are stored in a large Shared Heap which is accessible by all processes.
Garbage Collection
Currently Erlang uses a Generational garbage collection that runs inside each Erlang process private heap independently, and also a Reference Counting garbage collection occurs for global shared heap.
Private Heap GC: It is generational, so divides the heap into two segments: young and old generations. Also there are two strategies for collecting; Generational (Minor) and Fullsweep (Major). The generational GC just collects the young heap, but fullsweep collect both young and old heap.
Shared Heap GC: It is reference counting. Each object in shared heap (Refc) has a counter of references to it held by other objects (ProcBin) which are stored inside private heap of Erlang processes. If an object's reference counter reaches zero, the object has become inaccessible and will be destroyed.
To get more details and performance hints, just look at my article which is the source of the answer: Erlang Garbage Collection Details and Why It Matters
A reference paper for the algorithm: One Pass Real-Time Generational Mark-Sweep Garbage Collection (1995) by Joe Armstrong and Robert Virding in
1995 (at CiteSeerX)
Abstract:
Traditional mark-sweep garbage collection algorithms do not allow reclamation of data until the mark phase of the algorithm has terminated. For the class of languages in which destructive operations are not allowed we can arrange that all pointers in the heap always point backwards towards "older" data. In this paper we present a simple scheme for reclaiming data for such language classes with a single pass mark-sweep collector. We also show how the simple scheme can be modified so that the collection can be done in an incremental manner (making it suitable for real-time collection). Following this we show how the collector can be modified for generational garbage collection, and finally how the scheme can be used for a language with concurrent processes.1
Erlang has a few properties that make GC actually pretty easy.
1 - Every variable is immutable, so a variable can never point to a value that was created after it.
2 - Values are copied between Erlang processes, so the memory referenced in a process is almost always completely isolated.
Both of these (especially the latter) significantly limit the amount of the heap that the GC has to scan during a collection.
Erlang uses a copying GC. During a GC, the process is stopped then the live pointers are copied from the from-space to the to-space. I forget the exact percentages, but the heap will be increased if something like only 25% of the heap can be collected during a collection, and it will be decreased if 75% of the process heap can be collected. A collection is triggered when a process's heap becomes full.
The only exception is when it comes to large values that are sent to another process. These will be copied into a shared space and are reference counted. When a reference to a shared object is collected the count is decreased, when that count is 0 the object is freed. No attempts are made to handle fragmentation in the shared heap.
One interesting consequence of this is, for a shared object, the size of the shared object does not contribute to the calculated size of a process's heap, only the size of the reference does. That means, if you have a lot of large shared objects, your VM could run out of memory before a GC is triggered.
Most if this is taken from the talk Jesper Wilhelmsson gave at EUC2012.
I don't know your background, but apart from the paper already pointed out by jj1bdx you can also give a chance to Jesper Wilhelmsson thesis.
BTW, if you want to monitor memory usage in Erlang to compare it to e.g. C++ you can check out:
Erlang Instrument Module
Erlang OS_MON Application
Hope this helps!

Private heap or manage memory self

I know we could take some advantages from creating private heap of Windows especially for frequently allocated and de-allocated small chunks. But I think the normal approach is to allocate a large memory from default heap and manage the allocations and de-allocations ourselves. My question is which way is advantages and disadvantage between those two ways?
Thanks,
Max
Some advantages of managing your own heap:
You might be able to optimize very specifically for your own allocation needs and improve performance.
You may be able to avoid the use of synchronization objects if you know the concurrency rules.
A single free can release an entire set of allocations. For example, a short lived process that needs a bunch of small allocations that are freed all at once could carve them out of a larger block, which can be freed with a single call later.
The disadvantages, though, are very big. The added complexity will produce more bugs, more difficult maintenance, and quite possibly poorer performance in the end. I have absolutely no data to support this, but I suspect that more home-grown heap management systems result in worse performance than help it.
Some advantages of using the system's allocations (e.g., HeapAlloc):
Less complexity.
Reduced risk of concurrency problems in the allocation/freeing.
The ability to take advantage of the Low-Fragmentation Heap. This already does a very good job in most cases of handling the small allocations very efficiently.
Allocating larger chunks is commonly done in pool allocators, where the overhead of allocation and deallocation is reduced and locality is increased (as memory is more likely to be consecutive). More information on pool allocators
In many cases, fragmentation is your worst enemy. When you are allocating, keep object sizes consistent or in sizes (power of two is popular, but may be too wasteful). This reduces fragmentation as there are only a few common sizes of memory which are allocated.

Resources