golang profile with pprof, how to get hit count not duration? - go

how to get hit count like:
(pprof) top
Total: 2525 samples
298 11.8% 11.8% 345 13.7% runtime.mapaccess1_fast64
268 10.6% 22.4% 2124 84.1% main.FindLoops
not, durations like:
(pprof) top
2220ms of 3080ms total (72.08%)
Dropped 72 nodes (cum <= 15.40ms)
Showing top 10 nodes out of 111 (cum >= 60ms)
flat flat% sum% cum cum%
1340ms 43.51% 43.51% 1410ms 45.78% runtime.cgocall_errno
env: I using golang1.4, add below codes.
defer pprof.StopCPUProfile()
f, err := os.Create("innercpu.pprof")
if err != nil {
fmt.Println("Error: ", err)
}
pprof.StartCPUProfile(f)

You can use go tool pprof -callgrind -output callgrind.out innercpu.pprof to generate callgrind data out of your collected profiling data. Which you can then visualise with qcachegrind/kcachegrind. It'll display call counts.

Related

Why using channel in this function?

I was studying a blog about the timing of using go-routines, and I saw the example pasted below, from line 61 to line 65. But I don't get the purpose of using channel here.
It seems that he is iterating the channels to retrieve the msg inside go-routine.
But why not directly using string array?
58 func findConcurrent(goroutines int, topic string, docs []string) int {
59 var found int64
60
61 ch := make(chan string, len(docs))
62 for _, doc := range docs {
63 ch <- doc
64 }
65 close(ch)
66
67 var wg sync.WaitGroup
68 wg.Add(goroutines)
69
70 for g := 0; g < goroutines; g++ {
71 go func() {
72 var lFound int64
73 for doc := range ch {
74 items, err := read(doc)
75 if err != nil {
76 continue
77 }
78 for _, item := range items {
79 if strings.Contains(item.Description, topic) {
80 lFound++
81 }
82 }
83 }
84 atomic.AddInt64(&found, lFound)
85 wg.Done()
86 }()
87 }
88
89 wg.Wait()
90
91 return int(found)
92 }
This code is providing an example of a way of distributing work (finding strings within documents) amongst multiple goRoutines. Basically the code is starting goroutines and feeding them documents to search via a channel.
But why not directly using string array?
It would be possible to use a string array and a variable (lets call it count) to track what item in the array you were up to. You would have some code like (a little long winded to demonstrate a point):
for {
if count > len(docarray) {
break;
}
doc := docarray[count]
count++
// Process the document
}
However you would hit syncronisation issues. For example what happens if two go routines (running on different processor cores) get to if count > len(docarray) at the same time? Without something to prevent this they might both end up processing the same item in the slice (and potentially skipping the next element because they both run count++).
Syncronization of processes is complex and issues can be very hard to debug. Using channels hides a lot of this complexity from you and makes it more likely that your code will work as expected (it does not solve all issues; note the use of atomic.AddInt64(&found, lFound) in the example code to prevent another potential issue that would result from multiple go routines writing to a variable at the same time).
The author seems to just be using a contrived example to illustrate how channels work. Perhaps it would be desirable for him to come up with a more realistic example. But he does say:
Note: There are several ways and options you can take when writing a concurrent version of add. Don’t get hung up on my particular implementation at this time. If you have a more readable version that performs the same or better I would love for you to share it.
So it seems clear he wasn't trying to write the best code for the job, just something to illustrate his point.
he is using buffered channel so i dont think channel is doing any special work here, any normal string slice will also do the same.

runtime._ExternalCode Cpu usage is too high, Up to 80%

I wrote a tcp handler in golang, about 300 connections per second. There was no problem with the program just released to production. But after running for about 10 days, I see that the cpu usage is up to 100%. I used the golang tool "go tool pprof" to get the information of cpu usage :
File: gateway-w
Type: cpu
Time: Nov 7, 2018 at 5:38pm (CST)
Duration: 30.14s, Total samples = 30.13s ( 100%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 27.42s, 91.01% of 30.13s total
Dropped 95 nodes (cum <= 0.15s)
Showing top 10 nodes out of 28
flat flat% sum% cum cum%
24.69s 81.94% 81.94% 24.69s 81.94% runtime._ExternalCode /usr/local/go/src/runtime/proc.go
0.57s 1.89% 83.84% 0.57s 1.89% runtime.lock /usr/local/go/src/runtime/lock_futex.go
0.56s 1.86% 85.70% 0.56s 1.86% runtime.unlock /usr/local/go/src/runtime/lock_futex.go
0.26s 0.86% 86.56% 5.37s 17.82% gateway-w/connect/connect-tcp.tcpStartSession /go/src/gateway-w/connect/connect-tcp/tcp_framework.go
0.25s 0.83% 87.39% 1.67s 5.54% net.(*conn).Read /usr/local/go/src/net/net.go
0.24s 0.8% 88.18% 1.41s 4.68% net.(*netFD).Read /usr/local/go/src/net/fd_unix.go
0.23s 0.76% 88.95% 0.23s 0.76% runtime.nanotime /usr/local/go/src/runtime/sys_linux_amd64.s
0.22s 0.73% 89.68% 0.22s 0.73% internal/poll.(*fdMutex).incref /usr/local/go/src/internal/poll/fd_mutex.go
0.21s 0.7% 90.38% 0.21s 0.7% internal/poll.(*fdMutex).rwunlock /usr/local/go/src/internal/poll/fd_mutex.go
0.19s 0.63% 91.01% 0.19s 0.63% internal/poll.(*fdMutex).rwlock /usr/local/go/src/internal/poll/fd_mutex.go
my tcpHandle code is like this:
func tcpStartSession(conn net.Conn) {
defer closeTcp(conn)
var (last, n int
err error
buff []byte
)
last, n, err, buff =
0, 0, nil,
make([]byte, MAX_PACKET_LEN)
for {
// set read timeout
conn.SetReadDeadline(time.Now().Add(time.Duration(tcpTimeOutSec) * time.Second))
n, err = conn.Read(buff[last:])
if err != nil {
log.Info("tcp read error maybe timeout , ", err)
break
}
if n == 0 {
log.Debug("empty packet, continue")
continue
}
log.Debug("read bytes ", n)
log.Info("get a raw package:", hex.EncodeToString(buff[:last+n]))
last += n
...
for {
if last == 0 {
break
}
ret, err := protoHandle.IsWhole(buff[:last])
if err != nil {
log.Warn("proto handle check iswhole error", err)
}
log.Debug("rest buffer len = %d\n", ret)
if ret < 0 {
//wait for more tcp fragment.
break
}
packetLen := last - ret
packetBuf := make([]byte, packetLen)
copy(packetBuf, buff[:packetLen])
last = ret
if last > 0 {
copy(buff, buff[packetLen:packetLen+last])
}
...
}
}
}
I can't understand what runtime._ExternalCode means. This is the function inside golang.
my golang version is :go version go1.9.2 linux/amd64
my program is running on docker
my docker version is : 1.12.6
I hope someone can help me. Thank you very much!
I tried to upgrade the golang version to 1.10.3. After running for more than half a year, there was no problem. Recently, the same problem occurred, but I have not changed the program code. I suspect that there is a problem with this code:
conn.SetReadDeadline(time.Now().Add(time.Duration(tcpTimeOutSec) * time.Second))
Need your help, thank you.
As you have confirmed that your program was not built with CGO_ENABLED=0
The problem is (probably) in the C go parts of the program. pprof can't profile inside sections of C libraries
I believe there are some other things that count as "_External" like time.Now on some systems

Go CPU profile is lacking function call information

I have been trying to dive in to Go (golang) performance analysis, based on articles like https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs .
However, in the actual profiled programs, the generated CPU profiles have very little information. The go tool either tells that the profile is empty or it has no information about any function calls. This happens on both OS X and Linux.
I generated a minimal example of this situation - I am gathering the profile in a similar manner and facing the same issues in actual programs, too.
Here's the source code for miniprofile/main.go:
package main
import (
"fmt"
"os"
"runtime/pprof"
)
func do_something(prev string, limit int) {
if len(prev) < limit {
do_something(prev+"a", limit)
}
}
func main() {
f, err := os.Create("./prof")
if err != nil {
fmt.Println(err)
os.Exit(1)
}
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
do_something("", 100000)
}
I am expecting to see a CPU profile that tells that almost all the time has been spent on different recursive calls to do_something.
However, this happens (the minimal app above is called miniprofile) - not very useful:
$ go version
go version go1.6.2 darwin/amd64
$ go install .
$ miniprofile
$ go tool pprof --text prof
1.91s of 1.91s total ( 100%)
flat flat% sum% cum cum%
1.91s 100% 100% 1.91s 100%
Am I doing something in a horribly wrong way?
You're missing the binary argument to pprof:
go tool pprof --text miniprofile prof

Golang: benchmark Radix Tree Lookup

I've been trying to benchmark a Radix Tree implementation I wrote for sake of practice with Golang.
But I encountered a problem on "How should I benchmark it?". In the code below shows two cases or lets say different ways I would like to benchmark the LookUp func.
Case 1: Use one single slice of bytes which exist on the tree meaning it will be successful LookUp through all children nodes etc...
Case 2: Use a func to generate that random slice from the existing data in the tree meaning it will be successful LookUp as well...
I know the time expend will depend on the tree depth... I think Case 2 is close to a real world implementation or not?
QUESTION: Which case is more efficient or useful to benchmark?
Benchmark:
func BenchmarkLookUp(b *testing.B) {
radix := New()
insertData(radix, sampleData2)
textToLookUp := randomBytes()
for i := 0; i < b.N; i++ {
radix.LookUp(textToLookUp) // Case 1
//radix.LookUp(randomBytes()) // Case 2
}
}
func randomBytes() []byte {
strings := sampleData2()
return []byte(strings[random(0, len(strings))])
}
func sampleData2() []string {
return []string{
"romane",
"romanus",
"romulus",
...
}
}
Result Case 1:
PASS
BenchmarkLookUp-4 10000000 146 ns/op
ok github.com/falmar/goradix 2.068s
PASS
BenchmarkLookUp-4 10000000 149 ns/op
ok github.com/falmar/goradix 2.244s
Result Case 2:
PASS
BenchmarkLookUp-4 3000000 546 ns/op
ok github.com/falmar/goradix 3.094s
PASS
BenchmarkLookUp-4 3000000 538 ns/op
ok github.com/falmar/goradix 4.481s
Results when there is no match:
PASS
BenchmarkLookUp-4 10000000 194 ns/op
ok github.com/falmar/goradix 3.189s
PASS
BenchmarkLookUp-4 10000000 191 ns/op
ok github.com/falmar/goradix 3.243s
If your benchmark is random, that would make it very difficult to compare the performance between different implementations from one run to the next.
Instead, statically implement a few different benchmark cases that stress different areas of your algorithm. The cases should represent different scenarios, such as the case when there are no matches (as you already have), the case where there are many items in the source data that will be returned in a lookup, the case where there are many items and only 1 item will be returned, etc etc.

How to profile benchmarks using the pprof tool?

I want to profile my benchmarks generated by go test -c, but the go tool pprof needs a profile file usually generated inside the main function like this:
func main() {
flag.Parse()
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal(err)
}
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
}
How can I create a profile file within my benchmarks ?
As described in https://pkg.go.dev/cmd/go#hdr-Testing_flags you can specify the profile file using the flag -cpuprofile.
For example
go test -cpuprofile cpu.out
Use the -cpuprofile flag to go test as documented at http://golang.org/cmd/go/#hdr-Description_of_testing_flags
This post explains how to profile benchmarks with an example: Benchmark Profiling with pprof.
The following benchmark simulates some CPU work.
package main
import (
"math/rand"
"testing"
)
func BenchmarkRand(b *testing.B) {
for n := 0; n < b.N; n++ {
rand.Int63()
}
}
To generate a CPU profile for the benchmark test, run:
go test -bench=BenchmarkRand -benchmem -cpuprofile profile.out
The -memprofile and -blockprofile flags can be used to generate memory allocation and blocking call profiles.
To analyze the profile use the Go tool:
go tool pprof profile.out
(pprof) top
Showing nodes accounting for 1.16s, 100% of 1.16s total
Showing top 10 nodes out of 22
flat flat% sum% cum cum%
0.41s 35.34% 35.34% 0.41s 35.34% sync.(*Mutex).Unlock
0.37s 31.90% 67.24% 0.37s 31.90% sync.(*Mutex).Lock
0.12s 10.34% 77.59% 1.03s 88.79% math/rand.(*lockedSource).Int63
0.08s 6.90% 84.48% 0.08s 6.90% math/rand.(*rngSource).Uint64 (inline)
0.06s 5.17% 89.66% 1.11s 95.69% math/rand.Int63
0.05s 4.31% 93.97% 0.13s 11.21% math/rand.(*rngSource).Int63
0.04s 3.45% 97.41% 1.15s 99.14% benchtest.BenchmarkRand
0.02s 1.72% 99.14% 1.05s 90.52% math/rand.(*Rand).Int63
0.01s 0.86% 100% 0.01s 0.86% runtime.futex
0 0% 100% 0.01s 0.86% runtime.allocm
The bottleneck in this case is the mutex, caused by the default source in math/rand being synchronized.
Other profile presentations and output formats are also possible, e.g. tree. Type help for more options.
Note, that any initialization code before the benchmark loop will also be profiled.

Resources