I have a simple program that prints hello world. To emit gc trace information am using following command by directly running the main.go file
$ GODEBUG=gctrace=1 go run main.go
gc 1 #0.011s 1%: 0.020+1.3+0.056 ms clock, 0.24+0.62/0.38/1.6+0.67 ms cpu, 4->4->0 MB, 5 MB goal, 12 P
Hello Playground
gc 2 #0.021s 0%: 0.004+0.55+0.006 ms clock, 0.052+0.12/0.36/0.49+0.079 ms cpu, 4->4->0 MB, 5 MB goal, 12 P
Question
Is there a way to build the binary and yet print the gctraces while running the binary directly and without explicitly passing the GODEBUG=gctrace=1 to the binary?
(OR)
How to build binary with GODEBUG=gctrace=1 without explicitly mentioning at runtime ?
$ go build -O main // (builds binary main)
$ ./main // doesn't print gc trace information
Hello Playground
package main
import (
"fmt"
)
func main() {
fmt.Println("Hello, playground")
}
Related
When I run python code in a jupyter-lab (v3.4.3) ipython notebook (v8.4.0) and use the %%time cell magic, both cpu time and wall time are reported.
%%time
for i in range(10000000):
a = i*i
CPU times: user 758 ms, sys: 121 µs, total: 758 ms
Wall time: 757 ms
But when the same computation is performed using the %%sh magic to run a shell script, the cpu time results are nonsense.
%%time
%%sh
python -c "for i in range(10000000): a = i*i"
CPU times: user 6.14 ms, sys: 12.5 ms, total: 18.6 ms
Wall time: 920 ms
The docs for %time do say "Time execution of a Python statement or expression.", but this still surprised me because I had assumed that the shell script will run in a python subprocess and thus can also be measured. So, what's going on here? Is this a bug, or just a known caveat of using %%sh?
I know I can use the shell builtin time or /usr/bin/time to get similar output, but this is a bit cumbersome for multiple lines of shell---is there a better workaround?
I was reading this part of Parallel and Concurrent Programming in Haskell, and found the sequential version of the program to be far slower than the parallel one with one core:
Sequential:
$ cabal run fwsparse -O2 --ghc-options="-rtsopts" -- 1000 800 +RTS -s
...
Total time 11.469s ( 11.529s elapsed)
Parallel:
$ cabal run fwsparse1 -O2 --ghc-options="-rtsopts -threaded" -- 1000 800 +RTS -s -N1
...
Total time 4.906s ( 4.988s elapsed)
According to the book, the sequential one should be slightly faster than the parallel one (if it runs on a single core).
The only difference in the two programs was this function:
Sequential:
update g k = Map.mapWithKey shortmap g
Parallel:
update g k = runPar $ do
m <- Map.traverseWithKey (\i jmap -> spawn (return (shortmap i jmap))) g
traverse get m
Since the spawn function uses deepseq, I initially though it had something to do with strictness but the use of force didn't change the performance of the sequential program.
Finally, I managed to get the sequential one work faster:
$ cabal run fwsparse -O2 --ghc-options="-rtsopts" -- 1000 800 +RTS -s
...
Total time 3.891s ( 3.901s elapsed)
by changing the update function to this:
update g k = runIdentity $ Map.traverseWithKey (\i jmap -> pure $ shortmap i jmap) g
Why does using traverseWithKey in the Identity monad speed up performance? I checked the IntMap source code but couldn't figure out the reason.
Is this a bug or the expected behaviour?
(Also, this is my first question on StackOverflow so please tell me if I'm doing anything wrong.)
EDIT:
As per Ismor's comment, I turned off optimization (using -O0) and the sequential program runs in 19.5 seconds with either traverseWithKey or mapWithKey, while the parallel one runs in 21.5 seconds.
Why is traverseWithKey optimized to be so much faster than mapWithKey?
Which function should I use in practice?
I try to profiling my go library, to find out what is the cause of being so much slower than same thing in c++.
I have simple benchmark
func BenchmarkFile(t *testing.B) {
tmpFile, err := ioutil.TempFile("", TMP_FILE_PREFIX)
fw, err := NewFile(tmpFile.Name())
text := []byte("testing")
for i := 0; i < b.N; i++ {
_, err = fw.Write(text)
}
fw.Close()
}
NewFile return my custom Writer which encodes data to our binary representation, even compress them, and write to file system.
Running go test -bench . -memprofile mem.out -cpuprofile cpu.out I get
PASS
BenchmarkFile-16 2000000000 0.20 ns/op
ok .../writer/iowriter 9.074s
Than analysing it
# go tool pprof cpu.out
Entering interactive mode (type "help" for commands)
(pprof) top10
930ms of 930ms total ( 100%)
flat flat% sum% cum cum%
930ms 100% 100% 930ms 100%
(pprof)
I even try to write example.go app which is using my writer, and add pprof.StartCPUProfile(f) as is shown in http://blog.golang.org/profiling-go-programs but with same result.
What am I doing wrong, and how can I determine what is bottleneck of my lib?
Thank you in advance
Ok it's easy, I miss to add binary to go tool pprof, si it has to be
# go tool pprof write cpu.out
Entering interactive mode (type "help" for commands)
(pprof) top10
7.02s of 7.38s total (95.12%)
Dropped 14 nodes (cum <= 0.04s)
Showing top 10 nodes out of 32 (cum >= 0.19s)
flat flat% sum% cum cum%
6.55s 88.75% 88.75% 6.76s 91.60% syscall.Syscall
...
and when using benchmark tests, binary is created there and using it gives same result.
To expand on sejvolnd's answer:
pprof needs the binary that actually generated cpu.out file as a first argument.
So you need to run the command as go tool pprof <go binary of your program> <generaged profiling output file>
e.g. go tool pprof go_binary cpu.pprof
I've got my Go benchmark working with my API calls but I'm not exactly sure what it means below:
$ go test intapi -bench=. -benchmem -cover -v -cpuprofile=cpu.out
=== RUN TestAuthenticate
--- PASS: TestAuthenticate (0.00 seconds)
PASS
BenchmarkAuthenticate 20000 105010 ns/op 3199 B/op 49 allocs/op
coverage: 0.0% of statements
ok intapi 4.349s
How does it know how many calls it should make? I do have a loop with b.N as size of the loop but how does Golang know how many to run?
Also I now have cpu profile file. How can I use this to view it?
From TFM:
The benchmark function must run the target code b.N times. The benchmark package will vary b.N until the benchmark function lasts long enough to be timed reliably.
I'm used to debug my code using ghci. Often, something like this happens (not so obvious, of course):
ghci> let f#(_:x) = 0:1:zipWith(+)f x
ghci> length f
Then, nothing happens for some time, and if I don't react fast enough, ghci has eaten maybe 2 GB of RAM, causing my system to freeze. If it's too late, the only way to solve this problem is [ALT] + [PRINT] + [K].
My question: Is there an easy way to limit the memory, which can be consumed by ghci to, let's say 1 GB? If limit is exceed, the calculation should ve aborted or ghci should be killed.
A platform independant way to accomplish this is to supply the -M option as on option to the Haskell runtime like this
ghci +RTS -M1m
see the GHC documentation’s page on how to control the RTS (runtime system) for details.
The ghci output now looks like:
>ghci +RTS -M10m
GHCi, version 6.12.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude> let f#(_:x) = 0:1:zipWith(+)f x
Prelude> length f
Heap exhausted;
Current maximum heap size is 10485760 bytes (10 MB);
use `+RTS -M<size>' to increase it.
Running it under a shell with ulimit -m set is a fairly easy way. If you want to run with some limit on a regular basis, you can create a wrapper script that does ulimit before running ghci.