Go garbage collector overhead with minimal allocations?

Go garbage collector overhead with minimal allocations? - go

It's widely accepted that one of the main things holding Go back from C++ level performance is the garbage collector. I'd like to get some intuition to help reason about the overhead of Go's GC in different circumstances. For example, is there nontrivial GC overhead if a program never touches the heap, or just allocates a large block at setup to use as an object pool with self-management? Is a call made to the GC every x seconds, or on every allocation?
As a related question: is my initial assumption correct that Go's GC is the main impediment to C++ level performance, or are there some things that Go just does slower?

The pause time (stop the world) for garbage collection in Golang is in the order of a few milliseconds, or in more recent Golang less than that. (see
https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md)
C++ does not have a garbage collector so these times of pauses do not happen. However, C++ is not magic and memory management must occur if memory for storing objects is to be managed. Memory management is still happening somewhere in your program regardless of the language
Using a static block of memory in C++ and not dealing with any memory management issues is an approach. But Go can do this too. For an outline of how this is done in a real, high performance Go program, see this video
https://www.youtube.com/watch?time_continue=7&v=ZuQcbqYK0BY

Related

How could I make a Go program use more memory? Is that recommended?

I'm looking for option something similar to -Xmx in Java, that is to assign maximum runtime memory that my Go application can utilise. Was checking the runtime , but not entirely if that is the way to go.
I tried setting something like this with func SetMaxStack(), (likely very stupid)
debug.SetMaxStack(5000000000) // bytes
model.ExcelCreator()
The reason why I am looking to do this is because currently there is ample amount of RAM available but the application won't consume more than 4-6% , I might be wrong here but it could be forcing GC to happen much faster than needed leading to performance issue.
What I'm doing
Getting large dataset from RDBMS system , processing it to write out in excel.
Another reason why I am looking for such an option is to limit the maximum usage of RAM on the server where it will be ultimately deployed.
Any hints on this would greatly appreciated.

The current stable Go (1.10) has only a single knob which may be used to trade memory for lower CPU usage by the garbage collection the Go runtime performs.
This knob is called GOGC, and its description reads
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
So basically setting it to 200 would supposedly double the amount of memory the Go runtime of your running process may use.
Having said that I'd note that the Go runtime actually tries to adjust the behaviour of its garbage collector to the workload of your running program and the CPU processing power at hand.
I mean, that normally there's nothing wrong with your program not consuming lots of RAM—if the collector happens to sweep the garbage fast enough without hampering the performance in a significant way, I see no reason to worry about: the Go's GC is
one of the points of the most intense fine-tuning in the runtime,
and works very good in fact.
Hence you may try to take another route:
Profile memory allocations of your program.
Analyze the profile and try to figure out where the hot spots
are, and whether (and how) they can be optimized.
You might start here
and continue with the gazillion other
intros to this stuff.
Optimize. Typically this amounts to making certain buffers
reusable across different calls to the same function(s)
consuming them, preallocating slices instead of growing them
gradually, using sync.Pool where deemed useful etc.
Such measures may actually increase the memory
truly used (that is, by live objects—as opposed to
garbage) but it may lower the pressure on the GC.

Is it possible to deallocate memory manually in go?

After discussion with college, I wonder if it would be possible (even if completely does not make any sense) to deallocate memory manually in go (ie. by using unsafe package). Is it?

Here is a thread that may interest you: Add runtime.Free() for GOGC=off
Interesting part:
The Go GC does not have the ability to manually deallocate blocks
anymore. And besides, runtime. Free is unsafe (people might free still
in use pointers or double free) and then all sorts of C memory problem
that Go tries hard to get rid of will come back. The other reason is
that runtime sometimes allocates behind your back and there is no way
for the program to explicitly free memory.
If you really want to manually manage memory with Go, implement your
own memory allocator based on syscall.Mmap or cgo malloc/free.
Disabling GC for extended period of time is generally a bad solution
for a concurrent language like Go. And Go's GC will only be better
down the road.
TL;DR: Yes, but don't do it

I am a bit late but this question is high ranked on google, so here is an article by the creator of DGraph database which explains an alternative to malloc/calloc which is jemalloc, worth a look
https://dgraph.io/blog/post/manual-memory-management-golang-jemalloc/
With these techniques, we get the best of both worlds: We can do manual memory allocation in critical, memory-bound code paths. At the same time, we can get the benefits of automatic garbage collection in non-critical code paths. Even if you are not comfortable using Cgo or jemalloc, you could apply these techniques on bigger chunks of Go memory, with similar impact.
And I haven't tested it yet, but there is a github library called jemalloc-go
https://github.com/spinlock/jemalloc-go

Go 1.20 introduces an experimental concept of arenas for memory management, per the proposal proposal: arena: new package providing memory arenas. We could manage memory manually through arenas.
We propose the addition of a new arena package to the Go standard library. The arena package will allow the allocation of any number of arenas. Objects of arbitrary type can be allocated from the memory of the arena, and an arena automatically grows in size as needed. When all objects in an arena are no longer in use, the arena can be explicitly freed to reclaim its memory efficiently without general garbage collection. We require that the implementation provide safety checks, such that, if an arena free operation is unsafe, the program will be terminated before any incorrect behavior happens.
Sample codes:
a := arena.New()
var ptrT *T
a.New(&ptrT)
ptrT.val = 1
var sliceT []T
a.NewSlice(&sliceT, 100)
sliceT[99] .val = 4
a.Free()
Example: Per Go 1.20 Experiment: Memory Arenas vs Traditional Memory Management from Pyroscope.
Arenas are a powerful tool for optimizing Go programs, particularly in scenarios where your programs spend significant amount of time parsing large protobuf or JSON blobs.
Some recommendations:
Only use arenas in critical code paths. Do not use them everywhere
Profile your code before and after using arenas to make sure you're adding arenas in areas where they can provide the most benefit
Pay close attention to the lifecycle of the objects created on the arena. - Make sure you don't leak them to other components of your program where objects may outlive the arena
Use defer a.Free() to make sure that you don't forget to free memory
Use arena.Clone() to clone objects back to the heap if you want to use them after an arena was freed
Note: This proposal is on hold indefinitely due to serious API concerns. The GOEXPERIMENT=arena code may be changed incompatibly or removed at any time, and we do not recommend its use in production.

What is Go's memory footprint

This article on Wired about Dropbox's switch from Go to Rust for its MagicPocket product says
“memory footprint”—the amount of computer memory it demands while running Magic Pocket—was too high for the massive storage systems the company was trying to build.
Question(s): What exactly is Go's "memory footprint" (where does it come from, how is it measured etc, is it related to garbage collection,binary size, is it something that will always be high) and why is it higher than Rust's?

"It had a high memory footprint" is just another way to say their program used a lot of RAM. It is related to garbage collection in that GC'd programs only free memory periodically (because each GC cycle takes CPU time), whereas manual memory management tends to free memory more or less as soon as it's unused.
The downside of manual memory management is either that mistakes can cause crashes and security bugs (as in C++, where you can accidentally use a freed variable after the memory has been reused for something else) or you have to put effort into expressing the exact lifetimes of each variable, reference, etc. in your code so that the compiler can check that they're being used in a valid way (as in Rust, where you interact with the borrow checker to root out potentially incorrect uses of memory in your code).
The sentence in the Wired story makes it sound like "memory footprint" is a simple measurable quantity you could assign to any language (and your question takes that idea to its logical conclusion). It's not quite that simple. In different languages, doing different things has different costs in memory, performance, and so on, and you kind of have to understand languages'/runtimes' details to know how the language will work with a given sort of program.
For example, CPython has reference counting, and that frees unused memory sooner but at the cost of having to store and update reference counts. Java has, on the one hand, things like object headers that add a certain amount of memory overhead per object, but uses some tricks to speed garbage collection (like generational collection) that Go doesn't (yet). Or in Go, you might try to reduce the memory footprint of a program by recycling memory with free pools and adjusting GOGC to free unused memory more often as kostya said.
The bigger point there is not that those specific details I listed are super important, but that there can be a lot of details to consider other than "higher memory footprint" or "lower memory footprint."
So: "memory footprint" refers to the amount of RAM a particular program with a particular workload takes up. Bigger picture, it's one factor in a large set of tradeoffs that folks like you or I or Dropbox's team have to navigate.

The garbage collector requires free memory to be available to work efficiently. By default Go application needs roughly twice as much memory as size of live data set (memory occupied by application objects).
This can be tuned using GOGC environment variable. By setting it to a lower value the application will request less memory from OS but GC will run more frequently therefore will use more CPU resources. By setting it to a higher value the GC will run less frequently and use less resources but the application will have higher "memory footprint".
This is general idea but the exact memory, performance requirements and GOGC effect are highly application specific.

Memory Allocation/Deallocation Bottleneck?

How much of a bottleneck is memory allocation/deallocation in typical real-world programs? Answers from any type of program where performance typically matters are welcome. Are decent implementations of malloc/free/garbage collection fast enough that it's only a bottleneck in a few corner cases, or would most performance-critical software benefit significantly from trying to keep the amount of memory allocations down or having a faster malloc/free/garbage collection implementation?
Note: I'm not talking about real-time stuff here. By performance-critical, I mean stuff where throughput matters, but latency doesn't necessarily.
Edit: Although I mention malloc, this question is not intended to be C/C++ specific.

It's significant, especially as fragmentation grows and the allocator has to hunt harder across larger heaps for the contiguous regions you request. Most performance-sensitive applications typically write their own fixed-size block allocators (eg, they ask the OS for memory 16MB at a time and then parcel it out in fixed blocks of 4kb, 16kb, etc) to avoid this issue.
In games I've seen calls to malloc()/free() consume as much as 15% of the CPU (in poorly written products), or with carefully written and optimized block allocators, as little as 5%. Given that a game has to have a consistent throughput of sixty hertz, having it stall for 500ms while a garbage collector runs occasionally isn't practical.

Nearly every high performance application now has to use threads to exploit parallel computation. This is where the real memory allocation speed killer comes in when writing C/C++ applications.
In a C or C++ application, malloc/new must take a lock on the global heap for every operation. Even without contention locks are far from free and should be avoided as much as possible.
Java and C# are better at this because threading was designed in from the start and the memory allocators work from per-thread pools. This can be done in C/C++ as well, but it isn't automatic.

First off, since you said malloc, I assume you're talking about C or C++.
Memory allocation and deallocation tend to be a significant bottleneck for real-world programs. A lot goes on "under the hood" when you allocate or deallocate memory, and all of it is system-specific; memory may actually be moved or defragmented, pages may be reorganized--there's no platform-independent way way to know what the impact will be. Some systems (like a lot of game consoles) also don't do memory defragmentation, so on those systems, you'll start to get out-of-memory errors as memory becomes fragmented.
A typical workaround is to allocate as much memory up front as possible, and hang on to it until your program exits. You can either use that memory to store big monolithic sets of data, or use a memory pool implementation to dole it out in chunks. Many C/C++ standard library implementations do a certain amount of memory pooling themselves for just this reason.
No two ways about it, though--if you have a time-sensitive C/C++ program, doing a lot of memory allocation/deallocation will kill performance.

In general the cost of memory allocation is probably dwarfed by lock contention, algorithmic complexity, or other performance issues in most applications. In general, I'd say this is probably not in the top-10 of performance issues I'd worry about.
Now, grabbing very large chunks of memory might be an issue. And grabbing but not properly getting rid of memory is something I'd worry about.
In Java and JVM-based languages, new'ing objects is now very, very, very fast.
Here's one decent article by a guy who knows his stuff with some references at the bottom to more related links:
http://www.ibm.com/developerworks/java/library/j-jtp09275.html

A Java VM will claim and release memory from the operating system pretty much indepdently of what the application code is doing. This allows it to grab and release memory in large chunks, which is hugely more efficient than doing it in tiny individual operations, as you get with manual memory management.
This article was written in 2005, and JVM-style memory management was already streets ahead. The situation has only improved since then.
Which language boasts faster raw
allocation performance, the Java
language, or C/C++? The answer may
surprise you -- allocation in modern
JVMs is far faster than the best
performing malloc implementations. The
common code path for new Object() in
HotSpot 1.4.2 and later is
approximately 10 machine instructions
(data provided by Sun; see Resources),
whereas the best performing malloc
implementations in C require on
average between 60 and 100
instructions per call (Detlefs, et.
al.; see Resources). And allocation
performance is not a trivial component
of overall performance -- benchmarks
show that many real-world C and C++
programs, such as Perl and
Ghostscript, spend 20 to 30 percent of
their total execution time in malloc
and free -- far more than the
allocation and garbage collection
overhead of a healthy Java
application.

In Java (and potentially other languages with a decent GC implementation) allocating an object is very cheap. In the SUN JVM it only needs 10 CPU Cycles. A malloc in C/c++ is much more expensive, just because it has to do more work.
Still even allocation objects in Java is very cheap, doing so for a lot of users of a web application in parallel can still lead to performance problems, because more Garbage Collector runs will be triggered.
Therefore there are those indirect costs of an allocation in Java caused by the deallocation done by the GC. These costs are difficult to quantify because they depend very much on your setup (how much memory do you have) and your application.

Allocating and releasing memory in terms of performance are relatively costly operations. The calls in modern operating systems have to go all the way down to the kernel so that the operating system is able to deal with virtual memory, paging/mapping, execution protection etc.
On the other side, almost all modern programming languages hide these operations behind "allocators" which work with pre-allocated buffers.
This concept is also used by most applications which have a focus on throughput.

I know I answered earlier, however, that was ananswer to the other answer's, not to your question.
To speak to you directly, if I understand correctly, your performance use case criteria is throughput.
This to me, means's that you should be looking almost exclusivly at NUMA aware allocators.
None of the earlier references; IBM JVM paper, Microquill C, SUN JVM. Cover this point so I am highly suspect of their application today, where, at least on the AMD ABI, NUMA is the pre-eminent memory-cpu governer.
Hands down; real world, fake world, whatever world... NUMA aware memory request/use technologies are faster. Unfortunately, I'm running Windows currently, and I have not found the "numastat" which is available in linux.
A friend of mine has written about this in depth in his implmentation for the FreeBSD kernel.
Dispite me being able to show at-hoc, the typically VERY large amount of local node memory requests on top of the remote node (underscoring the obvious performance throughput advantage), you can surly benchmark yourself, and that would likely be what you need todo as your performance charicterisitc is going to be highly specific.
I do know that in a lot of ways, at least earlier 5.x VMWARE faired rather poorly, at that time at least, for not taking advantage of NUMA, frequently demanding pages from the remote node. However, VM's are a very unique beast when it comes to memory compartmentailization or containerization.
One of the references I cited is to Microsoft's API implmentation for the AMD ABI, which has NUMA allocation specialized interfaces for user land application developers to exploit ;)
Here's a fairly recent analysis, visual and all, from some browser add-on developers who compare 4 different heap implmentations. Naturally the one they developed turns out on top (odd how the people who do the testing often exhibit the highest score's).
They do cover in some ways quantifiably, at least for their use case, what the exact trade off is between space/time, generally they had identified the LFH (oh ya and by the way LFH is simply a mode apparently of the standard heap) or similarly designed approach essentially consumes signifcantly more memory off the bat however over time, may wind up using less memory... the grafix are neat too...
I would think however that selecting a HEAP implmentation based on your typical workload after you well understand it ;) is a good idea, but to well understand your needs, first make sure your basic operations are correct before you optimize these odds and ends ;)

This is where c/c++'s memory allocation system works the best. The default allocation strategy is OK for most cases but it can be changed to suit whatever is needed. In GC systems there's not a lot you can do to change allocation strategies. Of course, there is a price to pay, and that's the need to track allocations and free them correctly. C++ takes this further and the allocation strategy can be specified per class using the new operator:
class AClass
{
public:
void *operator new (size_t size); // this will be called whenever there's a new AClass
void *operator new [] (size_t size); // this will be called whenever there's a new AClass []
void operator delete (void *memory); // if you define new, you really need to define delete as well
void operator delete [] (void *memory);define delete as well
};
Many of the STL templates allow you to define custom allocators as well.
As with all things to do with optimisation, you must first determine, through run time analysis, if memory allocation really is the bottleneck before writing your own allocators.

According to MicroQuill SmartHeap Technical Specification, "a typical application [...] spends 40% of its total execution time on managing memory". You can take this figure as an upper bound, i personally feel that a typical application spends more like 10-15% of execution time allocating/deallocating memory. It rarely is a bottleneck in single-threaded application.
In multithreaded C/C++ applications standard allocators become an issue due to lock contention. This is where you start to look for more scalable solutions. But keep in mind Amdahl's Law.

Pretty much all of you are off base if you are talking about the Microsoft heap. Syncronization is effortlessly handled as is fragmentation.
The current perferrred heap is the LFH, (LOW FRAGMENTATION HEAP), it is default in vista+ OS's and can be configured on XP, via gflag, with out much trouble
It is easy to avoid any locking/blocking/contention/bus-bandwitth issues and the lot with the
HEAP_NO_SERIALIZE
option during HeapAlloc or HeapCreate. This will allow you to create/use a heap without entering into an interlocked wait.
I would reccomend creating several heaps, with HeapCreate, and defining a macro, perhaps, mallocx(enum my_heaps_set, size_t);
would be fine, of course, you need realloc, free also to be setup as appropiate. If you want to get fancy, make free/realloc auto-detect which heap handle on it's own by evaluating the address of the pointer, or even adding some logic to allow malloc to identify which heap to use based on it's thread id, and building a heierarchy of per-thread heaps and shared global heap's/pools.
The Heap* api's are called internally by malloc/new.
Here's a nice article on some dynamic memory management issues, with some even nicer references. To instrument and analyze heap activity.

Others have covered C/C++ so I'll just add a little information on .NET.
In .NET heap allocation is generally really fast, as it it just a matter of just grabbing the memory in the generation zero part of the heap. Obviously this cannot go on forever, which is where garbage collection comes in. Garbage collection may affect the performance of your application significantly since user threads must be suspended during compaction of memory. The fewer full collects, the better.
There are various things you can do to affect the workload of the garbage collector in .NET. Generally if you have a lot of memory reference the garbage collector will have to do more work. E.g. by implementing a graph using an adjacency matrix instead of references between nodes the garbage collector will have to analyze fewer references.
Whether that is actually significant in your application or not depends on several factors and you should profile the application with actual data before turning to such optimizations.

Region based memory management

I'm designing a high level language, and I want it to have the speed of C++ (it will use LLVM), but be safe and high level like C#. Garbage collection is slow, and new/delete is unsafe. I decided to try to use "region based memory management" (there are a few papers about it on the web, mostly for functional languages). The only "useful" language using it is Cyclone, but that also has GC. Basically, objects are allocated on a lexical stack, and are freed when the block closes. Objects can only refer to other objects in the same region or higher, to prevent dangling references. To make this more flexible, I added parallel regions that can be moved up and down the stack, and retained through loops. The type system would be able to verify assignments in most cases, but low overhead runtime checks would be necessary in some places.
Ex:
region(A) {
Foo#A x=new Foo(); //x is deleted when this region closes.
region(B,C) while(x.Y) {
Bar#B n=new Bar();
n.D=x; //OK, n is in lower region than x.
//x.D=n; would cause error: x is in higher region than n.
n.DoSomething();
Bar#C m=new Bar();
//m.D=n; would cause error: m and n are parallel.
if(m.Y)
retain(C); //On the next iteration, m is retained.
}
}
Does this seem practical? Would I need to add non-lexically scoped, reference counted regions? Would I need to add weak variables that can refer to any object, but with a check on region deletion? Can you think of any algorithms that would be hard to use with this system or that would leak?

I would discourage you from trying regions. The problem is that in order to make regions guaranteed to be safe, you need a very sophisticated type system---I'm sure you've looked at the papers by Tofte and Talpin and you have an idea of the complexities involved. Even if you do get regions working successfully, the chances are very hight that your program will require a whose lifetime is the lifetime of the program---and that region at least has to be garbage collected. (This is why Cyclone has regions and GC.)
Since you're just getting started, I'd encourage you to go with garbage collection. Modern garbage collectors can be made pretty fast without a lot of effort. The main issue is to allocate from contiguous free space so that allocation is fast. It helps to be targeting AMD64 or other machine with spare registers so you can use a hardware register as the allocation pointer.
There are lots of good ideas to adapt; one of the easiest to implement is a page-based collector like Joel Bartlett's mostly-copying collector, where the idea is you allocate only from completely empty pages.
If you want to study existing garbage collectors, Lua has a fairly sophisticated incremental garbage collector (so there are no visible pause times) and the implementation is only 700 lines. It is fast enough to be used in a lot of games, where performance matters.

If I were implementing a language with region based memory management, I would probably read A language-independent framework for region inference. That said, it's been a while since I looked into this stuff, and I'm sure the state of the art has moved on, if I ever even knew what the state of the art was.

Well you should go study Apples memory management. It has release pools and zones, which sure sound a lot like what you're doing here.
I won't comment on the "GC is slow" remark,

You can start by Tofte and Talpin's papers about region-based memory management.

How would it return a dynamically created object? Who would "own" it and be responsible for freeing the memory?
Refcounting or GC are so common because they are almost always the best choices. Generational garbage collectors can be very efficient.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio