Node.js has two parameters to control memory allocations as I know of:
--max_new_space_size and --max_old_space_size
What exactly are those mentioned NEW SPACE and OLD SPACE things?
In a generational garbage collector (which V8 uses), the heap is generally divided into two spaces. A young generation (new-space) and an old generation (old-space). Infant mortality or the generational hypothesis is the observation that, in most cases, young objects are much more likely to die than old objects.
New-space: Most objects are allocated here. New-space is small and is designed to be garbage collected very quickly, independent of other spaces.
Old-space: Contains most objects which may have pointers to other objects. Most objects are moved here after surviving in new-space for a while.
Ref: http://www.memorymanagement.org/glossary/g.html#term-generational-hypothesis
Ref: http://jayconrod.com/posts/55/a-tour-of-v8-garbage-collection
Related
OCaml From the Ground Up states that ...
At the machine level, a linked list is a pair of a head value and a pointer to the tail.
I have heard that linked lists (in imperative languages) tend to be slow because of cache misses, memory overhead and pointer chasing. I am curious if OCaml's garbage collector or memory management system avoids any of these issues, and if they do what sort of techniques or optimizations they employ internally that might be different from linked lists in other languages.
OCaml manages its own memory, it calls system-level memory allocation and deallocation primitives in its own terms (e.g. it can allocate a big chunk of heap memory during the start of the program, and manage OCaml values on it), so if the compiler and/or the runtime knows that you are allocating a list of a fixed sized, it can arrange for the cells to be close to each other in memory. And since there is no pointer type in the language itself, it can move values around during garbage collection, to avoid memory fragmentation, something a language like C or C++ cannot do (or with great effort to maintain the abstraction while allowing moves).
Those are general pointers about how garbage collected languages (imperative or not) may optimize memory management, but Understanding the Garbage Collector has more details about how the garbage collector actually works in OCaml.
A linked list is indeed a horrible structure to iterate over in general.
But this is mitigated a lot by the way OCaml allocates memory and how lists are created most of the time.
In OCaml the GC allocates a large block of memory as it's (minor) heap and maintains a pointer to the end of the used portion. An allocation simply increases the pointer by the needed amount of memory.
Combine that with the fact that most of the time lists are constructed in a very short time. Often the list creation is the only thing allocating memory. Think of List.map for example, or List.rev. That will produce a list where the nodes of the list are contiguous in memory. So the linked list isn't jumping all over the address space but is rather contained on a small chunk. This allows caching to work far better than you would expect for a linked list. Iterating the list will actually access memory sequentially.
The above means that a lot of lists are much more ordered than in other languages. And a lot of the time lists are temporary and will be purely in cache. It performs a lot better than it ought to.
There is another layer to this. In OCaml the garbage collector is a generational GC. New values are created on the minor heap which is scanned frequently. Temporary values are thus quickly reclaimed. Values that remain alive on the minor heap are copied to the major heap, which is scanned less frequent. The copy operation compacts the values, eliminating any holes caused by values that are no longer alive. This will bring list nodes closer together again if it had gaps in it in the first place. The same thing happens when the major heap is scanned, the memory is compacted bringing values that where allocated close in time nearer together.
While none of that guarantees that lists will be contiguous in memory it seems to avoid a lot of the bad effects associated with linked lists in other languages. None the less you shouldn't use lists when you need to iterate over data, or worse access the n-th node, frequently. Use an array instead. Appending is bad too unless your list is small (and will overflow the stack for large lists). Due to the later you often build a list in reverse, adding items to the front instead of appending at the end, and then reverse the list as final step. And that final List.rev will give you a perfectly contiguous list.
The llvm documentation says:
In practice, however, the locality and performance benefits of using aggressive garbage collection techniques dominates any low-level losses.
So what is it, exactly, that causes the performance gain when using garbage collection as opposed to manually managing memory? (besides the obvious decrease in code writing time) Is the benefit solely that performing heap compaction increases spatial locality and cache utilization? Or is there something else that helps more, like deleting everything at once?
On modern processors the memory caches are King. Suffering a cache miss can stall the processor for hundreds of cpu cycles, waiting for the slow bus to supply the data.
Making the caches effective requires locality of reference. In other words, if the next memory access is close to the previous one then the odds that the data is already in the cache are high.
A garbage collector can help a lot to make that work out well. The big win is not the collection, it is its ability to rebuild the object graph and reorganize the data structure while doing so. Compacting.
Imagine the typical data structure, an array of pointers to objects. Which is slowly being built up while, say, reading a bunch of strings from a file and turning them into field values of an object. Allocated objects will be scatter-shot in the address space doing so. Long lived objects pointed-to by the array separated by the worker objects, like strings. Iterating that array later is going to be pretty slow.
Until the garbage collector runs and rebuilds the data structure. Putting all of the pointed-to objects in order.
Now iterating the collection is very fast, since accessing element N makes it very likely that element N+1 is readily available. If not in the L1 cache then very good odds for L2 or L3 (if you have it).
Very big win, it is the one feature that made garbage collection competitive with explicit memory management. With the explicit kind having the problem of not supporting moving objects because it will invalidate a pointer.
I can only speak for the Oracle (ex-Sun) and IBM JVMs; their efficiency relies on the fact that newly-created objects are unlikely to live very long. So segregating them into their own area allows that area to be frequently compacted, since with few survivors that's a cheap operation. Frequent compaction means that free space can be kept contiguous, so object creation is also cheap because there's no free chain to traverse and no memory fragmentation.
Manual memory management schemes are rarely this efficient because this is a relatively complex way of doing things that is unlikely to be reinvented for each application. These garbage collectors have evolved and been optimised over a longer period and with more effort than individual applications ever receive. It would be surprising and disappointing if they weren't much more performant.
I doubt locality helps performance at all - admittedly small objects tend to be created at the same time in the same area of the heap (but this applies to C as well), over time, these small objects that remain will be compacted into a closely related area of the heap and it is supposedly this that give you an advantage over C-style allocations. However, show me a program that uses just these small objects and I'll show you a program that does sod all. Show me a program that passes all objects that are to be used on the stack and I'll show you one that screams with speed.
The de-allocation of memory is a performance benefit, short-term as they do not need to be de-allocated. However, when the garbage collector does kick in, this benefit disappears. Usually though, the collection occurs when nothing else is happening in the system (theoretically) so the cost is effectively nullified.
Compaction of the heap also helps allocation, all allocations can come from the beginning of the heap, and the memory manager doesn't have to walk the heap looking for the next free space block of the right size. However, traditional systems can gain the same amount of speed by using multiple fixed-block heaps (which mean you always allocate from a heap for the size of block you want, and you always allocate a fixed block, so walking the heap is just to find the first free block, and this can be removed using a bitmap)
So all in all, there isn't much of a benefit at all, except in benchmarks of course. In my experience the GC can and will jump in and slow you down dramatically at just the wrong time, usually when the system memory is getting filled because the user has done something like load a new page that required a lot of memory allocations.... which in turn required a collection.
It also has a tendency to use a lot of memory - 'memory is cheap' is the mantra of GC languages, so programs are written with this in mind, which means memory allocations are much more common, especially for temporaries and intermediate objects. Just look to StringBuilder classes for the evidence that this is well known. Strings may be 'solved' using this, but many other objects are still allocated with wild abandon. Any program that uses a lot of memory will find itself struggling with RAM IO - all that memory has to be brought into the CPU caches to be used, the more memory you use, the more IO your CPU MM will have to do and that can kill performance in the wrong circumstances.
In addition, when a GC occurs, you have to handle Finalised objects too, this isn't quite as bad as it used to be, but it can still halt your program while the finalisers are run.
Old Java GCs were dreadful for perf, though a lot of research has made them significantly better, they are still not perfect.
EDIT:
one more thing about localisation, imagine creating an array and adding a few items, then do a load of allocations, then you want to add another item to the array - with a GC system the added array element will not be localised, even after a compaction, each object in the array will be stored as an individual item on the heap. This is why I think the localisation issue is not as big a deal as it's made out to be. Now, compare that to an array that is allocated with a buffer and objects are allocated within the buffer space. That may require a re-alloc and copy to add a new item, but reading and modifying it is super fast.
One factor not yet mentioned is that, especially in multi-threaded systems, it can sometimes be difficult to predict with certainty what object will end up holding the last surviving reference to some other object. If one doesn't have to worry about object graphs that might contain cycles, it's possible to use reference counts for this purpose. Before copying a reference to an object, increment its reference count. Before destroying a reference to an object, decrement its reference count. It decrementing the reference count makes it hit zero, destroy the object as well as the reference. Such an approach works well on computers with only one CPU core; if only one thread can actually be running at any given time, one doesn't have to worry about what will happen if two threads try to adjust the same object's reference count simultaneously. Unfortunately, in systems with multiple CPU cores, any CPU that wants to adjust a reference count would have to coordinate that action with all the other CPUs to ensure that two CPUs never hit the counter at the exact same time. Such coordination is "free" with a single CPU, but is relatively expensive in multi-core systems.
When using a batch-mode garbage collector, object references may generally be freely assigned, copied, and destroyed, without inter-CPU coordination. It will periodically be necessary to have all the CPUs stop and run a garbage-collection cycle, but requiring all the CPUs to coordinate with each other once every few seconds or so is a lot cheaper than requiring them to coordinate with each other on every single object-reference assignment.
I want to know technical details about garbage collection (GC) and memory management in Erlang/OTP.
But, I cannot find on erlang.org and its documents.
I have found some articles online which talk about GC in a very general manner, such as what garbage collection algorithm is used.
To classify things, lets define the memory layout and then talk about how GC works.
Memory Layout
In Erlang, each thread of execution is called a process. Each process has its own memory and that memory layout consists of three parts: Process Control Block, Stack and Heap.
PCB: Process Control Block holds information like process identifier (PID), current status (running, waiting), its registered name, and other such info.
Stack: It is a downward growing memory area which holds incoming and outgoing parameters, return addresses, local variables and temporary spaces for evaluating expressions.
Heap: It is an upward growing memory area which holds process mailbox messages and compound terms. Binary terms which are larger than 64 bytes are NOT stored in process private heap. They are stored in a large Shared Heap which is accessible by all processes.
Garbage Collection
Currently Erlang uses a Generational garbage collection that runs inside each Erlang process private heap independently, and also a Reference Counting garbage collection occurs for global shared heap.
Private Heap GC: It is generational, so divides the heap into two segments: young and old generations. Also there are two strategies for collecting; Generational (Minor) and Fullsweep (Major). The generational GC just collects the young heap, but fullsweep collect both young and old heap.
Shared Heap GC: It is reference counting. Each object in shared heap (Refc) has a counter of references to it held by other objects (ProcBin) which are stored inside private heap of Erlang processes. If an object's reference counter reaches zero, the object has become inaccessible and will be destroyed.
To get more details and performance hints, just look at my article which is the source of the answer: Erlang Garbage Collection Details and Why It Matters
A reference paper for the algorithm: One Pass Real-Time Generational Mark-Sweep Garbage Collection (1995) by Joe Armstrong and Robert Virding in
1995 (at CiteSeerX)
Abstract:
Traditional mark-sweep garbage collection algorithms do not allow reclamation of data until the mark phase of the algorithm has terminated. For the class of languages in which destructive operations are not allowed we can arrange that all pointers in the heap always point backwards towards "older" data. In this paper we present a simple scheme for reclaiming data for such language classes with a single pass mark-sweep collector. We also show how the simple scheme can be modified so that the collection can be done in an incremental manner (making it suitable for real-time collection). Following this we show how the collector can be modified for generational garbage collection, and finally how the scheme can be used for a language with concurrent processes.1
Erlang has a few properties that make GC actually pretty easy.
1 - Every variable is immutable, so a variable can never point to a value that was created after it.
2 - Values are copied between Erlang processes, so the memory referenced in a process is almost always completely isolated.
Both of these (especially the latter) significantly limit the amount of the heap that the GC has to scan during a collection.
Erlang uses a copying GC. During a GC, the process is stopped then the live pointers are copied from the from-space to the to-space. I forget the exact percentages, but the heap will be increased if something like only 25% of the heap can be collected during a collection, and it will be decreased if 75% of the process heap can be collected. A collection is triggered when a process's heap becomes full.
The only exception is when it comes to large values that are sent to another process. These will be copied into a shared space and are reference counted. When a reference to a shared object is collected the count is decreased, when that count is 0 the object is freed. No attempts are made to handle fragmentation in the shared heap.
One interesting consequence of this is, for a shared object, the size of the shared object does not contribute to the calculated size of a process's heap, only the size of the reference does. That means, if you have a lot of large shared objects, your VM could run out of memory before a GC is triggered.
Most if this is taken from the talk Jesper Wilhelmsson gave at EUC2012.
I don't know your background, but apart from the paper already pointed out by jj1bdx you can also give a chance to Jesper Wilhelmsson thesis.
BTW, if you want to monitor memory usage in Erlang to compare it to e.g. C++ you can check out:
Erlang Instrument Module
Erlang OS_MON Application
Hope this helps!
I think that both (generational and incremental) are different approaches to make the garbage collection pauses faster. But what are the differences between generational and incremental? How do they work? And which one is better for real time software / produces less long pauses?
Also, the Boehm GC is any of those?
A generational GC is always incremental, because it does not collect all unreachable objects during a cycle. Conversely, an incremental GC does not necessarily employ a generation scheme to decide which unreachable objects to collect or not.
A generational GC divides the unreachable objects into different sets, roughly according to their last use - their age, so to speak. The basic theory is that objects that are most recently created, would become unreachable quickly. So the set with 'young' objects is collected in an early stage.
An incremental GC may be implemented with above generational scheme, but different methods can be employed to decide which group of objects should be sweeped.
One might look at this wikipedia page and further downward, for more information on both GC methods.
According to Boehm's website, his GC is incremental and generational:
The collector uses a mark-sweep
algorithm. It provides incremental and
generational collection under
operating systems which provide the
right kind of virtual memory support.
As far as a real time environment is concerned, there are several academic research papers describing new and ingenious ways to do garbage collection:
Nonblocking Real-Time Garbage Collection
Real-time garbage collection by IBM has good explanation on differences.
An incremental garbage collector is any garbage-collector that can run incrementally (meaning that it can do a little work, then some more work, then some more work), instead of having to run the whole collection without interruption. This stands in contrast to old stop-the-world garbage collectors that did e.g. a mark&sweep without any other code being able to work on the objects. But to be clear: Whether an incremental garbage collector actually runs in parallel to other code executing on the same objects is not important as long as it is interruptable (for which it has to e.g. distinguish between dirty and clean objects).
A generational garbage collector differentiates between old, medium and new objects. It can then do copying GC on the new objects (keyword "Eden"), mark&sweep for the old objects and different possibilities (depending on implementation) on the medium objects. Depending on implementation the way the generations of objects are distinguished is either by region occupied in memory or by flags. The challenge of generational GC is to keep lists of objects that refer from one generation to the other up to date.
Boem is an incremental generational GC as cited here: http://en.wikipedia.org/wiki/Boehm_garbage_collector
http://www.memorymanagement.org/glossary/i.html#incremental.garbage.collection
Some tracing garbage collection algorithms can pause in the middle of
a collection cycle while the mutator continues, without ending up with
inconsistent data. Such collectors can operate incrementally and are
suitable for use in an interactive system.
Primitive garbage collectors(1), once they start a collection cycle,
must either finish the task, or abandon all their work so far. This is
often an appropriate restriction, but is unacceptable when the system
must guarantee response times; for example, in systems with a user
interface and in real-time hardware control systems. Such systems
might use incremental garbage collection so that the time-critical
processing and the garbage collection can proceed effectively in
parallel, without wasted effort.
http://www.memorymanagement.org/glossary/g.html#generational.garbage.collection
Generational garbage collection is tracing garbage collection that
makes use of the generational hypothesis. Objects are gathered
together in generations. New objects are allocated in the youngest or
nursery generation, and promoted to older generations if they survive.
Objects in older generations are condemned less frequently, saving CPU
time.
It is typically rare for an object to refer to a younger object.
Hence, objects in one generation typically have few references to
objects in younger generations. This means that the scanning of old
generations in the course of collecting younger generations can be
done more efficiently by means of remembered sets.
In some purely functional languages (that is, without update), all
references are backwards in time, in which case remembered sets are
unnecessary.
The Boehm-Demers-Weiser has an incremental mode that you can enable by calling GC_enable_incremental. See http://www.hpl.hp.com/personal/Hans_Boehm/gc/gcinterface.html
I'm designing a high level language, and I want it to have the speed of C++ (it will use LLVM), but be safe and high level like C#. Garbage collection is slow, and new/delete is unsafe. I decided to try to use "region based memory management" (there are a few papers about it on the web, mostly for functional languages). The only "useful" language using it is Cyclone, but that also has GC. Basically, objects are allocated on a lexical stack, and are freed when the block closes. Objects can only refer to other objects in the same region or higher, to prevent dangling references. To make this more flexible, I added parallel regions that can be moved up and down the stack, and retained through loops. The type system would be able to verify assignments in most cases, but low overhead runtime checks would be necessary in some places.
Ex:
region(A) {
Foo#A x=new Foo(); //x is deleted when this region closes.
region(B,C) while(x.Y) {
Bar#B n=new Bar();
n.D=x; //OK, n is in lower region than x.
//x.D=n; would cause error: x is in higher region than n.
n.DoSomething();
Bar#C m=new Bar();
//m.D=n; would cause error: m and n are parallel.
if(m.Y)
retain(C); //On the next iteration, m is retained.
}
}
Does this seem practical? Would I need to add non-lexically scoped, reference counted regions? Would I need to add weak variables that can refer to any object, but with a check on region deletion? Can you think of any algorithms that would be hard to use with this system or that would leak?
I would discourage you from trying regions. The problem is that in order to make regions guaranteed to be safe, you need a very sophisticated type system---I'm sure you've looked at the papers by Tofte and Talpin and you have an idea of the complexities involved. Even if you do get regions working successfully, the chances are very hight that your program will require a whose lifetime is the lifetime of the program---and that region at least has to be garbage collected. (This is why Cyclone has regions and GC.)
Since you're just getting started, I'd encourage you to go with garbage collection. Modern garbage collectors can be made pretty fast without a lot of effort. The main issue is to allocate from contiguous free space so that allocation is fast. It helps to be targeting AMD64 or other machine with spare registers so you can use a hardware register as the allocation pointer.
There are lots of good ideas to adapt; one of the easiest to implement is a page-based collector like Joel Bartlett's mostly-copying collector, where the idea is you allocate only from completely empty pages.
If you want to study existing garbage collectors, Lua has a fairly sophisticated incremental garbage collector (so there are no visible pause times) and the implementation is only 700 lines. It is fast enough to be used in a lot of games, where performance matters.
If I were implementing a language with region based memory management, I would probably read A language-independent framework for region inference. That said, it's been a while since I looked into this stuff, and I'm sure the state of the art has moved on, if I ever even knew what the state of the art was.
Well you should go study Apples memory management. It has release pools and zones, which sure sound a lot like what you're doing here.
I won't comment on the "GC is slow" remark,
You can start by Tofte and Talpin's papers about region-based memory management.
How would it return a dynamically created object? Who would "own" it and be responsible for freeing the memory?
Refcounting or GC are so common because they are almost always the best choices. Generational garbage collectors can be very efficient.