Is there any performance penalty building go program using -race flag? - go

Hi I was wondering if there any performance penalty running a go program in production built using
go build -race

You can read about it in the article post that describes the go race detector at https://go.dev/doc/articles/race_detector
Quoting from that article (the last paragraph) :
Runtime Overhead
The cost of race detection varies by program, but for a typical program, memory usage may increase by 5-10x and execution time by
2-20x.
The race detector currently allocates an extra 8 bytes per defer and
recover statement. Those extra allocations are not recovered until the
goroutine exits. This means that if you have a long-running goroutine
that is periodically issuing defer and recover calls, the program
memory usage may grow without bound. These memory allocations will not
show up in the output of runtime.ReadMemStats or runtime/pprof.

Related

Is it nescessary to limit the number of go routines in an entirely cpu-bound workload?

If yes, how does one determine that maximum? That is the most important part to me. I'd really like to have it be set manually. I considered using runtime.GOMAXPROCS(0), as i doubt that more parallelism will yield any additional benefits. The comment seems to suggest, that it is marked for deprecation at some point.
From what I gather, the only limiting factor when it comes to go routines is memory, as a sleeping go routine still requires memory for its stack.
It's not strictly necessary. The number of threads running these goroutines is by default equal to the number of CPU cores on the machine (configurable through GOMAXPROCS), so there will be no contention at the thread level.
However, you might get performance benefits from having fewer goroutines ready to run, because of memory caching effects. For example, on an 8-core machine, if you have 1000 active goroutines that all touch significant amounts of memory, by the time a goroutine gets to run again, the needed memory pages have probably already been evicted from your CPU caches. With fewer goroutines, the odds of a cache hit are better.
As always with performance questions: the only way to be sure is to measure it yourself with a representative workload.
In our testing, we determined that it is best to spawn a fixed number of worker routines and use those to perform all the work. The creation and destruction of goroutines is lightweight, but not entirely free of overhead. That overhead is usually insignificant if the goroutines spend any amount of time blocked.
goroutines are very lightweight so it depends entirely on the system you are running on. An average process should have no problems with less than a million concurrent routines in 4GB Ram. Whether this goes for your target platform is, of course, something we can't answer without knowing what that platform is.
see this article and this, they are usefull

Go garbage collector overhead with minimal allocations?

It's widely accepted that one of the main things holding Go back from C++ level performance is the garbage collector. I'd like to get some intuition to help reason about the overhead of Go's GC in different circumstances. For example, is there nontrivial GC overhead if a program never touches the heap, or just allocates a large block at setup to use as an object pool with self-management? Is a call made to the GC every x seconds, or on every allocation?
As a related question: is my initial assumption correct that Go's GC is the main impediment to C++ level performance, or are there some things that Go just does slower?
The pause time (stop the world) for garbage collection in Golang is in the order of a few milliseconds, or in more recent Golang less than that. (see
https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md)
C++ does not have a garbage collector so these times of pauses do not happen. However, C++ is not magic and memory management must occur if memory for storing objects is to be managed. Memory management is still happening somewhere in your program regardless of the language
Using a static block of memory in C++ and not dealing with any memory management issues is an approach. But Go can do this too. For an outline of how this is done in a real, high performance Go program, see this video
https://www.youtube.com/watch?time_continue=7&v=ZuQcbqYK0BY

Why are goroutines much cheaper than threads in other languages?

In his talk - https://blog.golang.org/concurrency-is-not-parallelism, Rob Pike says that go routines are similar to threads but much much cheaper. Can someone explain why?
See "How goroutines work".
They are cheaper in:
memory consumption:
A thread starts with a large memory as opposed to a few Kb.
Setup and teardown costs
(That is why you have to maintain a pool of thread)
Switching costs
Threads are scheduled preemptively, and during a thread switch, the scheduler needs to save/restore ALL registers.
As opposed to Go where the the runtime manages the goroutines throughout from creation to scheduling to teardown. And the number of registers to save is lower.
Plus, as mentioned in "Go’s march to low-latency GC", a GC is easier to implement when the runtime is in charge of managing goroutines:
Since the introduction of its concurrent GC in Go 1.5, the runtime has kept track of whether a goroutine has executed since its stack was last scanned. The mark termination phase would check each goroutine to see whether it had recently run, and would rescan the few that had.
In Go 1.7, the runtime maintains a separate short list of such goroutines. This removes the need to look through the entire list of goroutines while user code is paused, and greatly reduces the number of memory accesses that can trigger the kernel’s NUMA-related memory migration code.

How to analyze golang memory?

I wrote a golang program, that uses 1.2GB of memory at runtime.
Calling go tool pprof http://10.10.58.118:8601/debug/pprof/heap results in a dump with only 323.4MB heap usage.
What's about the rest of the memory usage?
Is there any better tool to explain golang runtime memory?
Using gcvis I get this:
.. and this heap form profile:
Here is my code: https://github.com/sharewind/push-server/blob/v3/broker
The heap profile shows active memory, memory the runtime believes is in use by the go program (ie: hasn't been collected by the garbage collector). When the GC does collect memory the profile shrinks, but no memory is returned to the system. Your future allocations will try to use memory from the pool of previously collected objects before asking the system for more.
From the outside, this means that your program's memory use will either be increasing, or staying level. What the outside system presents as the "Resident Size" of your program is the number of bytes of RAM is assigned to your program whether it's holding in-use go values or collected ones.
The reason why these two numbers are often quite different are because:
The GC collecting memory has no effect on the outside view of the program
Memory fragmentation
The GC only runs when the memory in use doubles the memory in use after the previous GC (by default, see: http://golang.org/pkg/runtime/#pkg-overview)
If you want an accurate breakdown of how Go sees the memory you can use the runtime.ReadMemStats call: http://golang.org/pkg/runtime/#ReadMemStats
Alternatively, since you are using web-based profiling if you can access the profiling data through your browser at: http://10.10.58.118:8601/debug/pprof/ , clicking the heap link will show you the debugging view of the heap profile, which has a printout of a runtime.MemStats structure at the bottom.
The runtime.MemStats documentation (http://golang.org/pkg/runtime/#MemStats) has the explanation of all the fields, but the interesting ones for this discussion are:
HeapAlloc: essentially what the profiler is giving you (active heap memory)
Alloc: similar to HeapAlloc, but for all go managed memory
Sys: the total amount of memory (address space) requested from the OS
There will still be discrepancies between Sys, and what the OS reports because what Go asks of the system, and what the OS gives it are not always the same. Also CGO / syscall (eg: malloc / mmap) memory is not tracked by go.
As an addition to #Cookie of Nine's answer, in short: you can try the --alloc_space option.
go tool pprof use --inuse_space by default. It samples memory usage so the result is subset of real one.
By --alloc_space pprof returns all alloced memory since program started.
UPD (2022)
For those who knows Russian, I made a presentation
and wrote couple of articles on this topic:
RAM consumption in Golang: problems and solutions (Потребление оперативной памяти в языке Go: проблемы и пути решения)
Preventing Memory Leaks in Go, Part 1. Business Logic Errors (Предотвращаем утечки памяти в Go, ч. 1. Ошибки бизнес-логики)
Preventing memory leaks in Go, part 2. Runtime features (Предотвращаем утечки памяти в Go, ч. 2. Особенности рантайма)
Original answer (2017)
I was always confused about the growing residential memory of my Go applications, and finally I had to learn the profiling tools that are present in Go ecosystem. Runtime provides many metrics within a runtime.Memstats structure, but it may be hard to understand which of them can help to find out the reasons of memory growth, so some additional tools are needed.
Profiling environment
Use https://github.com/tevjef/go-runtime-metrics in your application. For instance, you can put this in your main:
import(
metrics "github.com/tevjef/go-runtime-metrics"
)
func main() {
//...
metrics.DefaultConfig.CollectionInterval = time.Second
if err := metrics.RunCollector(metrics.DefaultConfig); err != nil {
// handle error
}
}
Run InfluxDB and Grafana within Docker containers:
docker run --name influxdb -d -p 8086:8086 influxdb
docker run -d -p 9090:3000/tcp --link influxdb --name=grafana grafana/grafana:4.1.0
Set up interaction between Grafana and InfluxDB Grafana (Grafana main page -> Top left corner -> Datasources -> Add new datasource):
Import dashboard #3242 from https://grafana.com (Grafana main page -> Top left corner -> Dashboard -> Import):
Finally, launch your application: it will transmit runtime metrics to the contenerized Influxdb. Put your application under a reasonable load (in my case it was quite small - 5 RPS for a several hours).
Memory consumption analysis
Sys (the synonim of RSS) curve is quite similar to HeapSys curve. Turns out that dynamic memory allocation was the main factor of overall memory growth, so the small amount of memory consumed by stack variables seem to be constant and can be ignored;
The constant amount of goroutines garantees the absence of goroutine leaks / stack variables leak;
The total amount of allocated objects remains the same (there is no point in taking into account the fluctuations) during the lifetime of the process.
The most surprising fact: HeapIdle is growing with the same rate as a Sys, while HeapReleased is always zero. Obviously runtime doesn't return memory to OS at all , at least under the conditions of this test:
HeapIdle minus HeapReleased estimates the amount of memory
that could be returned to the OS, but is being retained by
the runtime so it can grow the heap without requesting more
memory from the OS.
For those who's trying to investigate the problem of memory consumption I would recommend to follow the described steps in order to exclude some trivial errors (like goroutine leak).
Freeing memory explicitly
It's interesting that the one can significantly decrease memory consumption with explicit calls to debug.FreeOSMemory():
// in the top-level package
func init() {
go func() {
t := time.Tick(time.Second)
for {
<-t
debug.FreeOSMemory()
}
}()
}
In fact, this approach saved about 35% of memory as compared with default conditions.
You can also use StackImpact, which automatically records and reports regular and anomaly-triggered memory allocation profiles to the dashboard, which are available in a historical and comparable form. See this blog post for more details Memory Leak Detection in Production Go Applications
Disclaimer: I work for StackImpact
Attempting to answer the following original question
Is there any better tool to explain golang runtime memory?
I find the following tools useful
Statsview
https://github.com/go-echarts/statsview
Statsview is integrated the standard net/http/pprof
Statsviz
https://github.com/arl/statsviz
This article will be pretty much helpful for your problem.
https://medium.com/safetycultureengineering/analyzing-and-improving-memory-usage-in-go-46be8c3be0a8
I ran a pprof analysis. pprof is a tool that’s baked into the Go language that allows for analysis and visualisation of profiling data collected from a running application. It’s a very helpful tool that collects data from a running Go application and is a great starting point for performance analysis. I’d recommend running pprof in production so you get a realistic sample of what your customers are doing.
When you run pprof you’ll get some files that focus on goroutines, CPU, memory usage and some other things according to your configuration. We’re going to focus on the heap file to dig into memory and GC stats. I like to view pprof in the browser because I find it easier to find actionable data points. You can do that with the below command.
go tool pprof -http=:8080 profile_name-heap.pb.gz
pprof has a CLI tool as well, but I prefer the browser option because I find it easier to navigate. My personal recommendation is to use the flame graph. I find that it’s the easiest visualiser to make sense of, so I use that view most of the time. The flame graph is a visual version of a function’s stack trace. The function at the top is the called function, and everything underneath it is called during the execution of that function. You can click on individual function calls to zoom in on them which changes the view. This lets you dig deeper into the execution of a specific function, which is really helpful. Note that the flame graph shows the functions that consume the most resources so some functions won’t be there. This makes it easier to figure out where the biggest bottlenecks are.
Is this helpful?
Try GO plugin for Tracy. Tracy is "A real time, nanosecond resolution, remote telemetry" (...).
GoTracy (name of the plugin) is the agent with connect with the Tracy and send necessary information to better understand your app process. After importing plugin You can put telemetry code like in description below:
func exampleFunction() {
gotracy.TracyInit()
gotracy.TracySetThreadName("exampleFunction")
for i := 0.0; i < math.Pi; i += 0.1 {
zoneid := gotracy.TracyZoneBegin("Calculating Sin(x) Zone", 0xF0F0F0)
gotracy.TracyFrameMarkStart("Calculating sin(x)")
sin := math.Sin(i)
gotracy.TracyFrameMarkEnd("Calculating sin(x)")
gotracy.TracyMessageLC("Sin(x) = "+strconv.FormatFloat(sin, 'E', -1, 64), 0xFF0F0F)
gotracy.TracyPlotDouble("sin(x)", sin)
gotracy.TracyZoneEnd(zoneid)
gotracy.TracyFrameMark()
}
}
The result of is similar to:
The plugin is placed in:
https://github.com/grzesl/gotracy
The Tracy is placed in:
https://github.com/wolfpld/tracy

is it possible to force a go routine to be run on a specific CPU?

I am reading about the go package "runtime" and see that i can among other (func GOMAXPROCS(n int)) set the number of CPU units that can be used to run my program. Can I force a goroutine to be run on a specific CPU of my choice?
In modern Go, I wouldn't lock goroutines to threads for efficiency. Go 1.5 added goroutine scheduling affinity, to minimize how often goroutines switch between OS threads. And any cost of the remaining migrations between CPUs has to be weighed against the benefit of the user-mode scheduler avoiding context switches into kernel mode. Finally, when switching costs are a real problem, sometimes a better focus is changing your program logic so it needs to switch less, like by communicating batches of work instead of individual work items.
But even considering all that, sometimes you simply have to lock a goroutine, like when a C API requires it, and I'll assume that's the case below.
If the whole program runs with GOMAXPROCS=1, then it's relatively simple to set a CPU affinity by calling out to the taskset utility from the schedutils package.
I had thought you were out of luck if GOMAXPROCS > 1 because then goroutines are migrated between OS threads at runtime. In fact, James Henstridge points out you can use runtime.LockOSThread() to keep your goroutine from migrating.
That doesn't solve locking the OS thread to a CPU. #yerden points out in a comment that the SchedSeatffinity function in the golang.org/x/sys/unix package, using 0 as the pid, ought to lock the calling thread to its current CPU.
In the "C API requires locking" use case, it might also work to call pthread_setaffinity_np from C code.
I haven't tested either of those ways to lock threads to CPUs, and details will vary by OS there.
Depends on your workload, but sometimes it's beneficial to start a go process per CPU, set gomaxprocs to 1 and pin the process to the CPU with taskset. Here is an excerpt on that topic from the awesome fasthttp library:
Use reuseport
listener.
Run a separate server instance per CPU core with GOMAXPROCS=1.
Pin each server instance to a separate CPU core using taskset.
Ensure the interrupts of multiqueue network card are evenly distributed between CPU cores. See this
article for
details.
Use Go 1.6 as it provides some considerable performance improvements.
Source: https://github.com/valyala/fasthttp#performance-optimization-tips-for-multi-core-systems

Resources