How to profile benchmarks using the pprof tool? - go

I want to profile my benchmarks generated by go test -c, but the go tool pprof needs a profile file usually generated inside the main function like this:
func main() {
flag.Parse()
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal(err)
}
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
}
How can I create a profile file within my benchmarks ?

As described in https://pkg.go.dev/cmd/go#hdr-Testing_flags you can specify the profile file using the flag -cpuprofile.
For example
go test -cpuprofile cpu.out

Use the -cpuprofile flag to go test as documented at http://golang.org/cmd/go/#hdr-Description_of_testing_flags

This post explains how to profile benchmarks with an example: Benchmark Profiling with pprof.
The following benchmark simulates some CPU work.
package main
import (
"math/rand"
"testing"
)
func BenchmarkRand(b *testing.B) {
for n := 0; n < b.N; n++ {
rand.Int63()
}
}
To generate a CPU profile for the benchmark test, run:
go test -bench=BenchmarkRand -benchmem -cpuprofile profile.out
The -memprofile and -blockprofile flags can be used to generate memory allocation and blocking call profiles.
To analyze the profile use the Go tool:
go tool pprof profile.out
(pprof) top
Showing nodes accounting for 1.16s, 100% of 1.16s total
Showing top 10 nodes out of 22
flat flat% sum% cum cum%
0.41s 35.34% 35.34% 0.41s 35.34% sync.(*Mutex).Unlock
0.37s 31.90% 67.24% 0.37s 31.90% sync.(*Mutex).Lock
0.12s 10.34% 77.59% 1.03s 88.79% math/rand.(*lockedSource).Int63
0.08s 6.90% 84.48% 0.08s 6.90% math/rand.(*rngSource).Uint64 (inline)
0.06s 5.17% 89.66% 1.11s 95.69% math/rand.Int63
0.05s 4.31% 93.97% 0.13s 11.21% math/rand.(*rngSource).Int63
0.04s 3.45% 97.41% 1.15s 99.14% benchtest.BenchmarkRand
0.02s 1.72% 99.14% 1.05s 90.52% math/rand.(*Rand).Int63
0.01s 0.86% 100% 0.01s 0.86% runtime.futex
0 0% 100% 0.01s 0.86% runtime.allocm
The bottleneck in this case is the mutex, caused by the default source in math/rand being synchronized.
Other profile presentations and output formats are also possible, e.g. tree. Type help for more options.
Note, that any initialization code before the benchmark loop will also be profiled.

Related

log.SetFlags(log.LstdFlags | log.Lshortfile) in production

Is it good practice (at least general practice) to have log.SetFlags(log.LstdFlags | log.Lshortfile) in production in Go? I wonder if there is whether performance or security issue by doing it in production. Since it is not default setting of log package in Go. Still can't find any official reference or even opinion article regarding that matter.
As for the performance. Yes, it has an impact, however, it is imho negligible for various reasons.
Testing
Code
package main
import (
"io/ioutil"
"log"
"testing"
)
func BenchmarkStdLog(b *testing.B) {
// We do not want to benchmark the shell
stdlog := log.New(ioutil.Discard, "", log.LstdFlags)
for i := 0; i < b.N; i++ {
stdlog.Println("foo")
}
}
func BenchmarkShortfile(b *testing.B) {
slog := log.New(ioutil.Discard, "", log.LstdFlags|log.Lshortfile)
for i := 0; i < b.N; i++ {
slog.Println("foo")
}
}
Result
goos: darwin
goarch: amd64
pkg: stackoverflow.com/go/logbench
BenchmarkStdLog-4 3803840 277 ns/op 4 B/op 1 allocs/op
BenchmarkShortfile-4 1000000 1008 ns/op 224 B/op 3 allocs/op
Your mileage may vary, but the order of magnitude should be roughly equal.
Why I think the impact is negligible
It is unlikely that your logging will be the bottleneck of your application, unless you write a shitload of logs. In 99 times out of 100, it is not the logging which is the bottleneck.
Get your application up and running, load test and profile it. You can still optimize then.
Hint: make sure you can scale out.

runtime._ExternalCode Cpu usage is too high, Up to 80%

I wrote a tcp handler in golang, about 300 connections per second. There was no problem with the program just released to production. But after running for about 10 days, I see that the cpu usage is up to 100%. I used the golang tool "go tool pprof" to get the information of cpu usage :
File: gateway-w
Type: cpu
Time: Nov 7, 2018 at 5:38pm (CST)
Duration: 30.14s, Total samples = 30.13s ( 100%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 27.42s, 91.01% of 30.13s total
Dropped 95 nodes (cum <= 0.15s)
Showing top 10 nodes out of 28
flat flat% sum% cum cum%
24.69s 81.94% 81.94% 24.69s 81.94% runtime._ExternalCode /usr/local/go/src/runtime/proc.go
0.57s 1.89% 83.84% 0.57s 1.89% runtime.lock /usr/local/go/src/runtime/lock_futex.go
0.56s 1.86% 85.70% 0.56s 1.86% runtime.unlock /usr/local/go/src/runtime/lock_futex.go
0.26s 0.86% 86.56% 5.37s 17.82% gateway-w/connect/connect-tcp.tcpStartSession /go/src/gateway-w/connect/connect-tcp/tcp_framework.go
0.25s 0.83% 87.39% 1.67s 5.54% net.(*conn).Read /usr/local/go/src/net/net.go
0.24s 0.8% 88.18% 1.41s 4.68% net.(*netFD).Read /usr/local/go/src/net/fd_unix.go
0.23s 0.76% 88.95% 0.23s 0.76% runtime.nanotime /usr/local/go/src/runtime/sys_linux_amd64.s
0.22s 0.73% 89.68% 0.22s 0.73% internal/poll.(*fdMutex).incref /usr/local/go/src/internal/poll/fd_mutex.go
0.21s 0.7% 90.38% 0.21s 0.7% internal/poll.(*fdMutex).rwunlock /usr/local/go/src/internal/poll/fd_mutex.go
0.19s 0.63% 91.01% 0.19s 0.63% internal/poll.(*fdMutex).rwlock /usr/local/go/src/internal/poll/fd_mutex.go
my tcpHandle code is like this:
func tcpStartSession(conn net.Conn) {
defer closeTcp(conn)
var (last, n int
err error
buff []byte
)
last, n, err, buff =
0, 0, nil,
make([]byte, MAX_PACKET_LEN)
for {
// set read timeout
conn.SetReadDeadline(time.Now().Add(time.Duration(tcpTimeOutSec) * time.Second))
n, err = conn.Read(buff[last:])
if err != nil {
log.Info("tcp read error maybe timeout , ", err)
break
}
if n == 0 {
log.Debug("empty packet, continue")
continue
}
log.Debug("read bytes ", n)
log.Info("get a raw package:", hex.EncodeToString(buff[:last+n]))
last += n
...
for {
if last == 0 {
break
}
ret, err := protoHandle.IsWhole(buff[:last])
if err != nil {
log.Warn("proto handle check iswhole error", err)
}
log.Debug("rest buffer len = %d\n", ret)
if ret < 0 {
//wait for more tcp fragment.
break
}
packetLen := last - ret
packetBuf := make([]byte, packetLen)
copy(packetBuf, buff[:packetLen])
last = ret
if last > 0 {
copy(buff, buff[packetLen:packetLen+last])
}
...
}
}
}
I can't understand what runtime._ExternalCode means. This is the function inside golang.
my golang version is :go version go1.9.2 linux/amd64
my program is running on docker
my docker version is : 1.12.6
I hope someone can help me. Thank you very much!
I tried to upgrade the golang version to 1.10.3. After running for more than half a year, there was no problem. Recently, the same problem occurred, but I have not changed the program code. I suspect that there is a problem with this code:
conn.SetReadDeadline(time.Now().Add(time.Duration(tcpTimeOutSec) * time.Second))
Need your help, thank you.
As you have confirmed that your program was not built with CGO_ENABLED=0
The problem is (probably) in the C go parts of the program. pprof can't profile inside sections of C libraries
I believe there are some other things that count as "_External" like time.Now on some systems

Measuring time in Go Routines

Measuring time around a function is easy in Go.
But what if you need to measure it 5000 times per second in parallel?
I'm referring to Correctly measure time duration in Go which contains great answers about how to measure time in Go.
What is the cost of using time.Now() 5000 times per second or more?
While it may depend on the underlying OS, let's consider on linux.
Time measurement depends on the programming language and its implementation, the operating system and its implementation, the hardware architecture, implementation, and speed, and so on.
You need to focus on facts, not speculation. In Go, start with some benchmarks. For example,
since_test.go:
package main
import (
"testing"
"time"
)
var now time.Time
func BenchmarkNow(b *testing.B) {
for N := 0; N < b.N; N++ {
now = time.Now()
}
}
var since time.Duration
var start time.Time
func BenchmarkSince(b *testing.B) {
for N := 0; N < b.N; N++ {
start = time.Now()
since = time.Since(start)
}
}
Output:
$ go test since_test.go -bench=. -benchtime=1s
goos: linux
goarch: amd64
BenchmarkNow-4 30000000 47.5 ns/op
BenchmarkSince-4 20000000 98.1 ns/op
PASS
ok command-line-arguments 3.536s
$ go version
go version devel +48c4eeeed7 Sun Mar 25 08:33:21 2018 +0000 linux/amd64
$ uname -srvio
Linux 4.13.0-37-generic #42-Ubuntu SMP Wed Mar 7 14:13:23 UTC 2018 x86_64 GNU/Linux
$ cat /proc/cpuinfo | grep 'model name' | uniq
model name : Intel(R) Core(TM) i7-7500U CPU # 2.70GHz
$
Now, ask yourself if 5,000 times per second is necessary, practical, and reasonable.
What are your benchmark results?

Go CPU profile is lacking function call information

I have been trying to dive in to Go (golang) performance analysis, based on articles like https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs .
However, in the actual profiled programs, the generated CPU profiles have very little information. The go tool either tells that the profile is empty or it has no information about any function calls. This happens on both OS X and Linux.
I generated a minimal example of this situation - I am gathering the profile in a similar manner and facing the same issues in actual programs, too.
Here's the source code for miniprofile/main.go:
package main
import (
"fmt"
"os"
"runtime/pprof"
)
func do_something(prev string, limit int) {
if len(prev) < limit {
do_something(prev+"a", limit)
}
}
func main() {
f, err := os.Create("./prof")
if err != nil {
fmt.Println(err)
os.Exit(1)
}
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
do_something("", 100000)
}
I am expecting to see a CPU profile that tells that almost all the time has been spent on different recursive calls to do_something.
However, this happens (the minimal app above is called miniprofile) - not very useful:
$ go version
go version go1.6.2 darwin/amd64
$ go install .
$ miniprofile
$ go tool pprof --text prof
1.91s of 1.91s total ( 100%)
flat flat% sum% cum cum%
1.91s 100% 100% 1.91s 100%
Am I doing something in a horribly wrong way?
You're missing the binary argument to pprof:
go tool pprof --text miniprofile prof

How do I find out which of 2 methods is faster?

I have 2 methods to trim the domain suffix from a subdomain and I'd like to find out which one is faster. How do I do that?
2 string trimming methods
You can use the builtin benchmark capabilities of go test.
For example (on play):
import (
"strings"
"testing"
)
func BenchmarkStrip1(b *testing.B) {
for br := 0; br < b.N; br++ {
host := "subdomain.domain.tld"
s := strings.Index(host, ".")
_ = host[:s]
}
}
func BenchmarkStrip2(b *testing.B) {
for br := 0; br < b.N; br++ {
host := "subdomain.domain.tld"
strings.TrimSuffix(host, ".domain.tld")
}
}
Store this code in somename_test.go and run go test -test.bench='.*'. For me this gives
the following output:
% go test -test.bench='.*'
testing: warning: no tests to run
PASS
BenchmarkStrip1 100000000 12.9 ns/op
BenchmarkStrip2 100000000 16.1 ns/op
ok 21614966 2.935s
The benchmark utility will attempt to do a certain number of runs until a meaningful time is
measured which is reflected in the output by the number 100000000. The code was run
100000000 times and each operation in the loop took 12.9 ns and 16.1 ns respectively.
So you can conclude that the code in BenchmarkStrip1 performed better.
Regardless of the outcome, it is often better to profile your program to see where the
real bottleneck is instead of wasting your time with micro benchmarks like these.
I would also not recommend writing your own benchmarking as there are some factors you might
not consider such as the garbage collector and running your samples long enough.

Resources