Garbage collection and memory management in Erlang - memory-management

I want to know technical details about garbage collection (GC) and memory management in Erlang/OTP.
But, I cannot find on erlang.org and its documents.
I have found some articles online which talk about GC in a very general manner, such as what garbage collection algorithm is used.

To classify things, lets define the memory layout and then talk about how GC works.
Memory Layout
In Erlang, each thread of execution is called a process. Each process has its own memory and that memory layout consists of three parts: Process Control Block, Stack and Heap.
PCB: Process Control Block holds information like process identifier (PID), current status (running, waiting), its registered name, and other such info.
Stack: It is a downward growing memory area which holds incoming and outgoing parameters, return addresses, local variables and temporary spaces for evaluating expressions.
Heap: It is an upward growing memory area which holds process mailbox messages and compound terms. Binary terms which are larger than 64 bytes are NOT stored in process private heap. They are stored in a large Shared Heap which is accessible by all processes.
Garbage Collection
Currently Erlang uses a Generational garbage collection that runs inside each Erlang process private heap independently, and also a Reference Counting garbage collection occurs for global shared heap.
Private Heap GC: It is generational, so divides the heap into two segments: young and old generations. Also there are two strategies for collecting; Generational (Minor) and Fullsweep (Major). The generational GC just collects the young heap, but fullsweep collect both young and old heap.
Shared Heap GC: It is reference counting. Each object in shared heap (Refc) has a counter of references to it held by other objects (ProcBin) which are stored inside private heap of Erlang processes. If an object's reference counter reaches zero, the object has become inaccessible and will be destroyed.
To get more details and performance hints, just look at my article which is the source of the answer: Erlang Garbage Collection Details and Why It Matters

A reference paper for the algorithm: One Pass Real-Time Generational Mark-Sweep Garbage Collection (1995) by Joe Armstrong and Robert Virding in
1995 (at CiteSeerX)
Abstract:
Traditional mark-sweep garbage collection algorithms do not allow reclamation of data until the mark phase of the algorithm has terminated. For the class of languages in which destructive operations are not allowed we can arrange that all pointers in the heap always point backwards towards "older" data. In this paper we present a simple scheme for reclaiming data for such language classes with a single pass mark-sweep collector. We also show how the simple scheme can be modified so that the collection can be done in an incremental manner (making it suitable for real-time collection). Following this we show how the collector can be modified for generational garbage collection, and finally how the scheme can be used for a language with concurrent processes.1

Erlang has a few properties that make GC actually pretty easy.
1 - Every variable is immutable, so a variable can never point to a value that was created after it.
2 - Values are copied between Erlang processes, so the memory referenced in a process is almost always completely isolated.
Both of these (especially the latter) significantly limit the amount of the heap that the GC has to scan during a collection.
Erlang uses a copying GC. During a GC, the process is stopped then the live pointers are copied from the from-space to the to-space. I forget the exact percentages, but the heap will be increased if something like only 25% of the heap can be collected during a collection, and it will be decreased if 75% of the process heap can be collected. A collection is triggered when a process's heap becomes full.
The only exception is when it comes to large values that are sent to another process. These will be copied into a shared space and are reference counted. When a reference to a shared object is collected the count is decreased, when that count is 0 the object is freed. No attempts are made to handle fragmentation in the shared heap.
One interesting consequence of this is, for a shared object, the size of the shared object does not contribute to the calculated size of a process's heap, only the size of the reference does. That means, if you have a lot of large shared objects, your VM could run out of memory before a GC is triggered.
Most if this is taken from the talk Jesper Wilhelmsson gave at EUC2012.

I don't know your background, but apart from the paper already pointed out by jj1bdx you can also give a chance to Jesper Wilhelmsson thesis.
BTW, if you want to monitor memory usage in Erlang to compare it to e.g. C++ you can check out:
Erlang Instrument Module
Erlang OS_MON Application
Hope this helps!

Related

List Cache Behavior

OCaml From the Ground Up states that ...
At the machine level, a linked list is a pair of a head value and a pointer to the tail.
I have heard that linked lists (in imperative languages) tend to be slow because of cache misses, memory overhead and pointer chasing. I am curious if OCaml's garbage collector or memory management system avoids any of these issues, and if they do what sort of techniques or optimizations they employ internally that might be different from linked lists in other languages.
OCaml manages its own memory, it calls system-level memory allocation and deallocation primitives in its own terms (e.g. it can allocate a big chunk of heap memory during the start of the program, and manage OCaml values on it), so if the compiler and/or the runtime knows that you are allocating a list of a fixed sized, it can arrange for the cells to be close to each other in memory. And since there is no pointer type in the language itself, it can move values around during garbage collection, to avoid memory fragmentation, something a language like C or C++ cannot do (or with great effort to maintain the abstraction while allowing moves).
Those are general pointers about how garbage collected languages (imperative or not) may optimize memory management, but Understanding the Garbage Collector has more details about how the garbage collector actually works in OCaml.
A linked list is indeed a horrible structure to iterate over in general.
But this is mitigated a lot by the way OCaml allocates memory and how lists are created most of the time.
In OCaml the GC allocates a large block of memory as it's (minor) heap and maintains a pointer to the end of the used portion. An allocation simply increases the pointer by the needed amount of memory.
Combine that with the fact that most of the time lists are constructed in a very short time. Often the list creation is the only thing allocating memory. Think of List.map for example, or List.rev. That will produce a list where the nodes of the list are contiguous in memory. So the linked list isn't jumping all over the address space but is rather contained on a small chunk. This allows caching to work far better than you would expect for a linked list. Iterating the list will actually access memory sequentially.
The above means that a lot of lists are much more ordered than in other languages. And a lot of the time lists are temporary and will be purely in cache. It performs a lot better than it ought to.
There is another layer to this. In OCaml the garbage collector is a generational GC. New values are created on the minor heap which is scanned frequently. Temporary values are thus quickly reclaimed. Values that remain alive on the minor heap are copied to the major heap, which is scanned less frequent. The copy operation compacts the values, eliminating any holes caused by values that are no longer alive. This will bring list nodes closer together again if it had gaps in it in the first place. The same thing happens when the major heap is scanned, the memory is compacted bringing values that where allocated close in time nearer together.
While none of that guarantees that lists will be contiguous in memory it seems to avoid a lot of the bad effects associated with linked lists in other languages. None the less you shouldn't use lists when you need to iterate over data, or worse access the n-th node, frequently. Use an array instead. Appending is bad too unless your list is small (and will overflow the stack for large lists). Due to the later you often build a list in reverse, adding items to the front instead of appending at the end, and then reverse the list as final step. And that final List.rev will give you a perfectly contiguous list.

How to measure the performance of the Erlang Garbage Collector?

I have started programming in Erlang recently and there are a few things I want to understand regarding garbage collection (GC). As far as I understand, there is a generational GC for the private heap of each process and a reference counting GC for the global shared heap.
What I would like to know is if there is anyway to get:
How many number of collection cycles?
How many bytes are allocated and deallocated, on a global level or process level?
What are the private heaps, and shared heap sizes? And can we define this as a GC parameter?
How long does it take to collect garbage? The % of time needed?
Is there a way to run a program without GC?
Is there a way to get this kind of information, either with code or using some commands when I run an Erlang program?
Thanks.
To get information for a single process, you can call erlang:process_info(Pid). This will yield (as of Erlang 18.0) the following fields:
> erlang:process_info(self()).
[{current_function,{erl_eval,do_apply,6}},
{initial_call,{erlang,apply,2}},
{status,running},
{message_queue_len,0},
{messages,[]},
{links,[<0.27.0>]},
{dictionary,[]},
{trap_exit,false},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.26.0>},
{total_heap_size,4184},
{heap_size,2586},
{stack_size,24},
{reductions,3707},
{garbage_collection,[{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,7}]},
{suspending,[]}]
The number of collection cycles for the process is available in the field minor_gcs under the section garbage_collection.
Per Process
The current heap size for the process is available in the field heap_size from the results above (in words, 4 bytes on a 32-bit VM and 8 bytes on a 64-bit VM). The total memory consumption of the process can be obtained by calling erlang:process_info(Pid, memory) which returns for example {memory,34312} for the above process. This includes call stack, heap and internal structures.
Deallocations (and allocations) can be traced using erlang:trace/3. If the trace flag is garbage_collection you will received messages on the form {trace, Pid, gc_start, Info} and {trace, Pid, gc_end, Info}. The Info field of the gc_start message contains such things as heap_size and old_heap_size.
Per System
Top level statistics of the system can be obtained by erlang:memory/0:
> erlang:memory().
[{total,15023008},
{processes,4215272},
{processes_used,4215048},
{system,10807736},
{atom,202481},
{atom_used,187597},
{binary,325816},
{code,4575293},
{ets,234816}]
Garbage collection statistics can be obtained via erlang:statistics(garbage_collection) which yields:
> statistics(garbage_collection).
{85,23961,0}
Where (as of Erlang 18.0) the first field is the total number of garbage collections performed by the VM and the second field is the total number of words reclaimed.
The heap sizes for a process are available under the fields total_heap_size (all heap fragments and stack) and heap_size (the size of the youngest heap generation) from the process info above.
They can be controlled via spawn options, specifically min_heap_size which sets the initial heap size for a process.
To set it for all process, erlang:system_flag(min_heap_size, MinHeapSize) can be called.
You can also control global VM memory allocation via the +M... options to the Erlang VM. The flags are described here. However, this requires extensive knowledge about the internals of the Erlang VM and its allocators and using them should not be taken lightly.
This can be obtained via the tracing described in answer 2. If you use the option timestamp when tracing, you will receive a timestamp with each trace message that can be used to calculate the total GC time.
Short answer: no.
Long answer: Maybe. You can control the initial heap size (via min_heap_size) which will affect when garbage collection will occur the first time. You can also control when a full sweep will be performed with the fullsweep_after option.
More information can be found in the Academic and Historical Questions and Processes section of the Efficiency Guide.
The most practical way of introspecting Erlang memory usage at runtime is via the Recon library, as Steve Vinoski mentioned.

How to release memory allocated by gcnew?

After some tests with help of Task Manager, I understood one thing about gcnew — memory allocated for local variables remaines allocated even if control leaves function, and is re-allocated only when control re-enters this function — so I'm in perplexity, how to deallocate memory myself. Here is some example of the problem:
void Foo(void)
{
System::Text::StringBuilder ^ t = gcnew System::Text::StringBuilder("");
int i = 0;
while(++i < 20000000) t->Append(i);
return;
}
As I mentioned, memory for variable t remains after leaving Foo(), delete not work as it works for new, and calling Foo() once, only gives me pointless allocated memory.
This is gcnew, which means garbage collected allocation. It will be disposed and deallocated by GC thread
Your function uses memory for code and data. The code is a fixed amount and will be used the entire time the library or program is loaded. The data is only used when the function is executing.
Data used by a program is either static or dynamic. Static data is laid out by the compiler and is basically equivalent to code (except that it might be marked as non-executable and/or read-only to prevent accidents). Dynamic data is temporary and allocated from a stack or heap (or CPU registers).
In a classic program, the stack and heap share the same memory address range with the stack at one end, growing toward the heap and the heap at the other end, trying not to grow into the stack. However, with modern address ranges on the order of 1TB, a heap generally has a lot of room.
Keep in mind that when a program requests an address range, it's just signaling to the operating system that it's okay to use that address for data reading, data writing and/or code execution. Until it actually puts something there, there is no load on the system. Also keep in mind with a virtual memory system, process memory is effectively allocated on the swap file/device (hard drive) with optimizations especially using RAM for caching, copy on write and many other techniques. (Data written to a memory address might never make it to the swap file, but that's up to the operating system.)
The data your function needs is for the two variables: t and i. t is a reference to a garbage collected object. i is an integer. Both are quite small and short-lived. You could think of them as being on the stack. When the function returns, the stack frame is popped and their memory is reused by the next stack operation. If you are looking at memory allocation, there won't be a change because the amount of memory allocated to the stack would not be changed.
Now in the execution of your function, a new object is created and, the way it's filled with data, it takes up quite a bit of memory. You could consider that object to be created in the heap. You don't need to delete it since it is a garbage collection object. When the garbage collector runs by walking all objects reachable from a set of root objects, it will find that the object is not reachable and add its space to a free list. When space for a new object is needed that doesn't fit into any blocks on the free list, more of the heap's address range will be used.
The CLR heap is compactable, which means it can move objects around in order to coalesce free blocks. Using this ability, it can move objects out of areas of allocated memory and give it back to the operating system, thereby freeing up space in the swap file.
So, there are three things that have to happen for you to see a reduction in the amount of memory allocated to the process:
The garbage collection has run to find unreachable objects.
The heap has been compacted.
The heap allocation has been reduced.
None of these things are really necessary until the swap file can't grow anymore. Obviously, the system has been designed for performance and to be a good citizen so it wouldn't take it that far. You can influence when garbage collection runs but this is only very rarely helpful and is generally not done.

What is the main performance gain from garbage collection?

The llvm documentation says:
In practice, however, the locality and performance benefits of using aggressive garbage collection techniques dominates any low-level losses.
So what is it, exactly, that causes the performance gain when using garbage collection as opposed to manually managing memory? (besides the obvious decrease in code writing time) Is the benefit solely that performing heap compaction increases spatial locality and cache utilization? Or is there something else that helps more, like deleting everything at once?
On modern processors the memory caches are King. Suffering a cache miss can stall the processor for hundreds of cpu cycles, waiting for the slow bus to supply the data.
Making the caches effective requires locality of reference. In other words, if the next memory access is close to the previous one then the odds that the data is already in the cache are high.
A garbage collector can help a lot to make that work out well. The big win is not the collection, it is its ability to rebuild the object graph and reorganize the data structure while doing so. Compacting.
Imagine the typical data structure, an array of pointers to objects. Which is slowly being built up while, say, reading a bunch of strings from a file and turning them into field values of an object. Allocated objects will be scatter-shot in the address space doing so. Long lived objects pointed-to by the array separated by the worker objects, like strings. Iterating that array later is going to be pretty slow.
Until the garbage collector runs and rebuilds the data structure. Putting all of the pointed-to objects in order.
Now iterating the collection is very fast, since accessing element N makes it very likely that element N+1 is readily available. If not in the L1 cache then very good odds for L2 or L3 (if you have it).
Very big win, it is the one feature that made garbage collection competitive with explicit memory management. With the explicit kind having the problem of not supporting moving objects because it will invalidate a pointer.
I can only speak for the Oracle (ex-Sun) and IBM JVMs; their efficiency relies on the fact that newly-created objects are unlikely to live very long. So segregating them into their own area allows that area to be frequently compacted, since with few survivors that's a cheap operation. Frequent compaction means that free space can be kept contiguous, so object creation is also cheap because there's no free chain to traverse and no memory fragmentation.
Manual memory management schemes are rarely this efficient because this is a relatively complex way of doing things that is unlikely to be reinvented for each application. These garbage collectors have evolved and been optimised over a longer period and with more effort than individual applications ever receive. It would be surprising and disappointing if they weren't much more performant.
I doubt locality helps performance at all - admittedly small objects tend to be created at the same time in the same area of the heap (but this applies to C as well), over time, these small objects that remain will be compacted into a closely related area of the heap and it is supposedly this that give you an advantage over C-style allocations. However, show me a program that uses just these small objects and I'll show you a program that does sod all. Show me a program that passes all objects that are to be used on the stack and I'll show you one that screams with speed.
The de-allocation of memory is a performance benefit, short-term as they do not need to be de-allocated. However, when the garbage collector does kick in, this benefit disappears. Usually though, the collection occurs when nothing else is happening in the system (theoretically) so the cost is effectively nullified.
Compaction of the heap also helps allocation, all allocations can come from the beginning of the heap, and the memory manager doesn't have to walk the heap looking for the next free space block of the right size. However, traditional systems can gain the same amount of speed by using multiple fixed-block heaps (which mean you always allocate from a heap for the size of block you want, and you always allocate a fixed block, so walking the heap is just to find the first free block, and this can be removed using a bitmap)
So all in all, there isn't much of a benefit at all, except in benchmarks of course. In my experience the GC can and will jump in and slow you down dramatically at just the wrong time, usually when the system memory is getting filled because the user has done something like load a new page that required a lot of memory allocations.... which in turn required a collection.
It also has a tendency to use a lot of memory - 'memory is cheap' is the mantra of GC languages, so programs are written with this in mind, which means memory allocations are much more common, especially for temporaries and intermediate objects. Just look to StringBuilder classes for the evidence that this is well known. Strings may be 'solved' using this, but many other objects are still allocated with wild abandon. Any program that uses a lot of memory will find itself struggling with RAM IO - all that memory has to be brought into the CPU caches to be used, the more memory you use, the more IO your CPU MM will have to do and that can kill performance in the wrong circumstances.
In addition, when a GC occurs, you have to handle Finalised objects too, this isn't quite as bad as it used to be, but it can still halt your program while the finalisers are run.
Old Java GCs were dreadful for perf, though a lot of research has made them significantly better, they are still not perfect.
EDIT:
one more thing about localisation, imagine creating an array and adding a few items, then do a load of allocations, then you want to add another item to the array - with a GC system the added array element will not be localised, even after a compaction, each object in the array will be stored as an individual item on the heap. This is why I think the localisation issue is not as big a deal as it's made out to be. Now, compare that to an array that is allocated with a buffer and objects are allocated within the buffer space. That may require a re-alloc and copy to add a new item, but reading and modifying it is super fast.
One factor not yet mentioned is that, especially in multi-threaded systems, it can sometimes be difficult to predict with certainty what object will end up holding the last surviving reference to some other object. If one doesn't have to worry about object graphs that might contain cycles, it's possible to use reference counts for this purpose. Before copying a reference to an object, increment its reference count. Before destroying a reference to an object, decrement its reference count. It decrementing the reference count makes it hit zero, destroy the object as well as the reference. Such an approach works well on computers with only one CPU core; if only one thread can actually be running at any given time, one doesn't have to worry about what will happen if two threads try to adjust the same object's reference count simultaneously. Unfortunately, in systems with multiple CPU cores, any CPU that wants to adjust a reference count would have to coordinate that action with all the other CPUs to ensure that two CPUs never hit the counter at the exact same time. Such coordination is "free" with a single CPU, but is relatively expensive in multi-core systems.
When using a batch-mode garbage collector, object references may generally be freely assigned, copied, and destroyed, without inter-CPU coordination. It will periodically be necessary to have all the CPUs stop and run a garbage-collection cycle, but requiring all the CPUs to coordinate with each other once every few seconds or so is a lot cheaper than requiring them to coordinate with each other on every single object-reference assignment.

What's the difference between memory allocation and garbage collection, please?

I understand that 'Garbage Collection' is a form of memory management and that it's a way to automatically reclaim unused memory.
But what is 'memory allocation' and the conceptual difference from 'Garbage Collection'?
They are Polar opposites. So yeah, pretty big difference.
Allocating memory is the process of claiming a memory space to store things.
Garbage Collection (or freeing of memory) is the process of releasing that memory back to the pool of available memory.
Many newer languages perform both of these steps in the background for you when variables are declared/initialized, and fall out of scope.
Memory allocation is the act of asking for some memory to the system to use it for something.
Garbage collection is a process to check if some memory that was previously allocated is no longer really in use (i.e. is no longer accessible from the program) to free it automatically.
A subtle point is that the objective of garbage collection is not actually "freeing objects that are no longer used", but to emulate a machine with infinite memory, allowing you to continue to allocate memory and not caring about deallocating it; for this reason, it's not a substitute for the management of other kind resources (e.g. file handles, database connections, ...).
A simple pseudo-code example:
void myFoo()
{
LinkedList<int> myList = new LinkedList<int>();
return;
}
This will request enough new space on the heap to store the LinkedList object.
However, when the function body is over, myList dissapears and you do not have anymore anyway of knowing where this LinkedList is stored (the memory address). Hence, there is absolutely no way to tell to the system to free that memory, and make it available to you again later.
The Java Garbage Collector will do that for you automatically, in the cost of some performance, and with also introducing a little non-determinism (you cannot really tell when the GC will be called).
In C++ there is no native garbage collector (yet?). But the correct way of managing memory is by the use of smart_pointers (eg. std::auto_ptr (deprecated in C++11), std::shared_ptr) etc etc.
You want a book. You go to the library and request the book you want. The library checks to see if they have the book (in which case they do) and you gladly take it and know you must return it later.
You go home, sit down, read the book and finish it. You return the book back to the library the next day because you are finished with it.
That is a simple analogy for memory allocation and garbage collection. Computers have limited memory, just like libraries have limited copies of books. When you want to allocate memory you need to make a request and if the computer has sufficient memory (the library has enough copies for you) then what you receive is a chunk of memory. Computers need memory for storing data.
Since computers have limited memory, you need to return the memory otherwise you will run out (just like if no one returned the books to the library then the library would have nothing, the computer will explode and burn furiously before your very eyes if it runs out of memory... not really). Garbage collection is the term for checking whether memory that has been previously allocated is no longer in use so it can be returned and reused for other purposes.
Memory allocation asks the computer for some memory, in order to store data. For example, in C++:
int* myInts = new int[howManyIntsIWant];
tells the computer to allocate me enough memory to store some number of integers.
Another way of doing the same thing would be:
int myInts[6];
The difference here is that in the second example, we know when the code is written and compiled exactly how much space we need - it's 6 * the size of one int. This lets us do static memory allocation (which uses memory on what's called the "stack").
In the first example we don't know how much space is needed when the code is compiled, we only know it when the program is running and we have the value of howManyIntsIWant. This is dynamic memory allocation, which gets memory on the "heap".
Now, with static allocation we don't need to tell the computer when we're finished with the memory. This relates to how the stack works; the short version is that once we've left the function where we created that static array, the memory is swallowed up straight away.
With dynamic allocation, this doesn't happen so the memory has to be cleaned up some other way. In some languages, you have to write the code to deallocate this memory, in other it's done automatically. This is garbage collection - some automatic process built into the language that will sweep through all of the dynamically allocated memory on the heap, work out which bits aren't being used and deallocate them (i.e. free them up for other processes and programs).
So: memory allocation = asking for memory for your program. Garbage collection = where the programming language itself works out what memory isn't being used any more and deallocates it for you.

Resources