Can't get golang pprof working - go

I have tried to profile some golang applications but I couldn't have that working, I have followed these two tutorials:
http://blog.golang.org/profiling-go-programs
http://saml.rilspace.org/profiling-and-creating-call-graphs-for-go-programs-with-go-tool-pprof
Both says that after adding some code lines to the application, you have to execute your app, I did that and I receiveed the following message in the screen:
2015/06/16 12:04:00 profile: cpu profiling enabled,
/var/folders/kg/4fxym1sn0bx02zl_2sdbmrhr9wjvqt/T/profile680799962/cpu.pprof
So, I understand that the profiling is being executed, sending info to the file.
But, when I see the file size, in any program that I test, it is always 64bytes.
When I try to open the cpu.pprof file with pprof, and I execute the "top10" command, I see that nothing is in the file:
("./fact" is my app)
go tool pprof ./fact
/var/folders/kg/4fxym1sn0bx02zl_2sdbmrhr9wjvqt/T/profile680799962/cpu.pprof
top10 -->
(pprof) top10 0 of 0 total ( 0%)
flat flat% sum% cum cum%
So, it is like nothing is happening when I am profiling.
I have tested it in mac (this example) and in ubuntu, with three different programs.
Do you know that I am doing wrong?
Then example program is very simple, this is the code (is a very simple factorial program that I take from internet):
import "fmt"
import "github.com/davecheney/profile"
func fact(n int) int {
if n == 0 {
return 1
}
return n * fact(n-1)
}
func main() {
defer profile.Start(profile.CPUProfile).Stop()
fmt.Println(fact(30))
}
Thanks,
Fer

As inf mentioned already, your code executes too fast. The reason is that pprof works by repeatedly halting your program during its execution, looking at which function is running at that moment in time and writing that down (together with the whole function call stack). Pprof samples with a rate of 100 samples per second. This is hardcoded in runtime/pprof/pprof.go as you can easily check (see https://golang.org/src/runtime/pprof/pprof.go line 575 and the comment above it):
func StartCPUProfile(w io.Writer) error {
// The runtime routines allow a variable profiling rate,
// but in practice operating systems cannot trigger signals
// at more than about 500 Hz, and our processing of the
// signal is not cheap (mostly getting the stack trace).
// 100 Hz is a reasonable choice: it is frequent enough to
// produce useful data, rare enough not to bog down the
// system, and a nice round number to make it easy to
// convert sample counts to seconds. Instead of requiring
// each client to specify the frequency, we hard code it.
const hz = 100
// Avoid queueing behind StopCPUProfile.
// Could use TryLock instead if we had it.
if cpu.profiling {
return fmt.Errorf("cpu profiling already in use")
}
cpu.Lock()
defer cpu.Unlock()
if cpu.done == nil {
cpu.done = make(chan bool)
}
// Double-check.
if cpu.profiling {
return fmt.Errorf("cpu profiling already in use")
}
cpu.profiling = true
runtime.SetCPUProfileRate(hz)
go profileWriter(w)
return nil
}
The longer your program runs the more samples can be made and the more probable it will become that also short running functions will be sampled. If your program finishes before even the first sample is made, than the generated cpu.pprof will be empty.
As you can see from the code above, the sampling rate is set with
runtime.SetCPUProfileRate(..)
If you call runtime.SetCPUProfileRate() with another value before you call StartCPUProfile(), you can override the sampling rate. You will receive a warning message during execution of your program telling you "runtime: cannot set cpu profile rate until previous profile has finished." which you can ignore. It results since pprof.go calls SetCPUProfileRate() again. Since you have already set the value, the one from pprof will be ignored.
Also, Dave Cheney has released a new version of his profiling tool, you can find it here: https://github.com/pkg/profile . There, you can, among other changes, specify the path where the cpu.pprof is written to:
defer profile.Start(profile.CPUProfile, profile.ProfilePath(".")).Stop()
You can read about it here: http://dave.cheney.net/2014/10/22/simple-profiling-package-moved-updated
By the way, your fact() function will quickly overflow, even if you take int64 as parameter and return value. 30! is roughly 2*10^32 and an int64 stores values only up to 2^63-1 which is roughly 9*10^18.

The problem is that your function is running too fast and pprof can't sample it. Try adding a loop around the fact call and sum the result to artificially prolong the program.

I struggled with empty pprof files, until I realized that I followed outdated blog articles.
The upstream docs are good: https://pkg.go.dev/runtime/pprof
Write a test which you want to profile, then:
go test -cpuprofile cpu.prof -memprofile mem.prof -bench .
this creates cpu.prof and mem.prof.
You can analyze them like this:
go tool pprof cpu.prof
This gives you a command-line. The "top" and "web" commands are used often.
Same for memory:
go tool pprof mem.prof

Related

Go performance penalty in high number of calls to append

I'm writing an emulator in Go, and for debugging purposes I'm logging the cpu' state at every emulator's cycle to generate a log file later.
There's something I'm not doing properly because while the logger is enabled performance drops and makes the emulator unusable.
Profiler shows clearly the culprit resides in the logging routine (logStep method):
logStep method is very simple, it calls CreateState to snapshot current cpu state in a struct, and then adds it to a slice (in method Log).
I call this method at every emulated cpu cycle (around 30.000 times per second), and I suspect either Garbage Collector is slowing my execution or I'm doing something wrong with this data structure.
I get the profile graph is pointing me to runtime growslice caused by an append located in (*cpu6502Logger)Log, but I'm unable to find information on how to do this more efficiently.
Also, I scratch my head on why CreateState takes that long to just create a simple struct.
This is what CpuState looks like:
type CpuState struct {
Registers Cpu6502Registers
CurrentInstruction Instruction
RawOpcode [4]byte
EvaluatedAddress Address
CyclesSinceReset uint32
}
This is how I create a CPU Snapshot:
func CreateState(cpu Cpu6502) CpuState {
pc := cpu.Registers().Pc
var rawOpcode [4]byte
rawOpcode[0] = 0x00
pc++
instruction := cpu.instructions[rawOpcode[0]]
for i := byte(0); i < (instruction.Size() - 1); i++ {
rawOpcode[1+i] = cpu.memory.Read(pc+Address(i))
}
_, evaluatedAddress, _, _ := cpu.addressEvaluators[instruction.AddressMode()](pc)
state := CpuState{
*cpu.Registers(),
instruction,
rawOpcode,
evaluatedAddress,
cpu.cycle,
}
return state
}
And finally, how I add this snapshot to a collection (log method in the profile graph). I've also addde how I initialize logger.snapshots:
func createCPULogger(outputPath string) cpu6502Logger {
return cpu6502Logger{
outputPath: outputPath,
snapshots: make([]CpuState, 0, 10024),
}
}
func (logger *cpu6502Logger) Log(state CpuState) {
logger.snapshots = append(logger.snapshots, state)
}
Disclaimer: following text contains grammar mistakes but i dont give a damn
why is it slow
Maintaining one gigantic slice to hold all data there is is wery costy mainly when it constantly extends. Each time you append few elements, whole memory section is copied to bigges section to allow expansion. with grownig slice, complexity grows and each realocation is slower and slower. You told us that you emulate tousands of cpu states per second.
solution
The best way to deal with this is allocating fixed buffer of some length. Now we now that eventually we will run out of space. When that happens we have two options. First you can write all data ftom buffer to file then truncate the buffer and start filling again (then write again). Other option is to save filled buffers in a slice and allocate new one. Choos witch one fits your machine. (slow or small ram is not good for second solution)
why does this help
i think this also helps the emulator it self. There will be performance spikes when restoring buffer, but most of the time, performance will be at maximum. Allocating big memory is just slow as alocator is less likely to find fitting section on first try. Garbage collection is also wery unhappy with frequent allocations. By allocating buffer and filling it, we use one big allocation, (but not too big), and store data in sections. Sections we already saved can stey where they are. We can also say that in this case we are handling memory our selfs more then gc does. (no garbage memory produced)

How to demonstrate memory visibility problems in Go?

I'm giving a presentation about the Go Memory Model. The memory model says that without a happens-before relationship between a write in one goroutine, and a read in another goroutine, there is no guarantee that the reader will observe the change.
To have a bigger impact on the audience, instead of just telling them that bad things can happen if you don't synchronize, I'd like to show them.
When I run the below code on my machine (2017 MacBook Pro with 3.5GHz dual-core Intel Core i7), it exits successfully.
Is there anything I can do to demonstrate the memory visibility issues?
For example are there any specific changes to the following values I could make to demonstrate the issue:
use different compiler settings
use an older version of Go
run on a different operating system
run on different hardware (such as ARM or a machine with multiple NUMA nodes).
For example in Java the flags -server and -client affect the optimizations the JVM takes and lead to visibility issues occurring.
I'm aware that the answer may be no, and that the spec may have been written to give future maintainers more flexibility with optimization. I'm aware I can make the code never exit by setting GOMAXPROCS=1 but that doesn't demonstrate visibility issues.
package main
var a string
var done bool
func setup() {
a = "hello, world"
done = true
}
func main() {
go setup()
for !done {
}
print(a)
}

Is it good idea to generate random string until success with "crypto/rand"?

Is it a good idea to generate a secure random hex string until the process succeeds?
All examples I've come across show that if rand.Read returns error, we should panic, os.Exit(1) or return empty string and the error.
I need my program to continue to function in case of such errors and wait until a random string is generated. Is it a good idea to loop until the string is generated, any pitfalls with that?
import "crypto/rand"
func RandomHex() string {
var buf [16]byte
for {
_, err := rand.Read(buf[:])
if err == nil {
break
}
}
return hex.EncodeToString(buf[:])
}
No. It may always return an error in certain contexts.
Example: playground: don't use /dev/urandom in crypto/rand
Imagine that a machine does not have the source that crypto/rand gets data from or the program runs in a context that doesn't have access to that source. In that case you might consider having the program return that error in a meaningful way rather than spin.
More explicitly, if you are serious in your use of crypto/rand then consider writing RandomHex such that it is exceptionally clear to the caller that it is meant for security contexts (possibly rename it) and return the error from RandomHex. The calling function needs to handle that error and let the user know that something is very wrong. For example in a rest api, I'd expect that error to surface to the request handler, fail & return a 500 at that point, and log a high severity error.
Is it a good idea to loop until the string is generated,
That depends. Probably yes.
any pitfalls with that?
You discard the random bytes read on error. And this in a tight loop.
This may drain you entropy source (depending on the OS) faster than
it can be filled.
Instead of an unbound infinite loop: Break after n rounds and give up.
Graceful degradation or stopping is best: If your program is stuck in
an endless loop it is also not "continue"ing.

Number of specific go routines running

In go's runtime lib we have NumGoroutine() to return the total number of go routines running at that time. I was wondering if there was a simple way of getting the number of go routines running of a specific function?
Currently I have it telling me I have 1000, or whatever, of all go routines but I would like to know I have 500 go routines running func Foo. Possible? Simple? Shouldn't bother with it?
I'm afraid you have to count the goroutines on your own if you are interested in those numbers. The cheapest way to achieve your goal would be to use the sync/atomic package directly.
import "sync/atomic"
var counter int64
func example() {
atomic.AddInt64(&counter, 1)
defer atomic.AddInt64(&counter, -1)
// ...
}
Use atomic.LoadInt64(&counter) whenever you want to read the current value of the counter.
And it's not uncommon to have a couple of such counters in your program, so that you can monitor it easily. For example, take a look at the CacheStats struct of the recently published groupcache source.

How to locate idle time (and network IO time, etc.) in XPerf?

Let's say I have a contrived program:
#include <Windows.h>
void useless_function()
{
Sleep(5000);
}
void useful_function()
{
// ... do some work
useless_function();
// ... do some more work
}
int main()
{
useful_function();
return 0;
}
Objective: I want the profiler to tell me useful_function() is needlessly calling useless_function() which waits for no obvious reasons. Under XPerf, this doesn't show up in any of the graphs I have because the call to WaitForMultipleObjects() seem to be accounted to Idle.exe instead of my own program.
And here's the xperf command line that I currently run:
xperf -on Latency -stackwalk Profile
Any ideas?
(This is not restricted to wait functions. The above might have been solved by placing breakpoints at NtWaitForMultipleObjects. Ideally there could be a way to see the stack sample that's taking up a lot of wall-clock time as opposed to only CPU time)
I think what you are looking for is the Wait analysis with Ready Thread functionality in Xperf. It captures every context switch and gives you the call stack of the thread once it wakes up from sleep (or an otherwise blocked operation). In your case, you would see the stack just after the call sleep(5000) as well as the time spend sleeping.
The functionality is a bit obscure to use. But it is fortunately well described here:
Use Xperf's Wait Analysis for Application-Performance Troubleshooting
Wait Analysis is the way to do this. You should:
Record the CSWITCH provider, in order to get all context switches
Record call stacks on context switches by adding +CSWITCH to your -stackwalk argument
Probably record call stacks on the ready thread to get more information on who readied you (i.e.; who released the Mutex or CS or semaphore and where) by adding +READYTHREAD to your -stackwalk
Then you use CPU Usage (Precise) in WPA (or xperfview, but that's ancient) to look at the context switches and find where your TimeSinceLast is high on a thread that shouldn't be going idle. You'll typically want the columns in CPU Usage (Precise) in this sort of order:
NewProcess (your process being switched in)
NewThreadId
NewThreadStack
ReadyingProcess (who made your thread ready to run)
ReadyingThreadId (optional)
ReadyThreadStack (optional, requires +ReadyThread on -stackwalk)
Orange bar
Count
TimeSinceLast (us) - sort by this column, usually
Whatever other columns you want
For details see these particular articles from my blog:
- https://randomascii.wordpress.com/2014/08/19/etw-training-videos-available-now/
- https://randomascii.wordpress.com/2012/06/19/wpaxperf-trace-analysis-reimagined/
This "profiler" will tell you - just randomly pause it a few times and look at the stack. If do some work takes 5 seconds, and do some more work takes 5 seconds, then 33% of the time the stack will look like this
main: calling useful_function
useful_function: calling useless_function
useless_function: calling Sleep
So roughly 33% of your stack samples will show exactly that. Any line of code that's costing some fraction of wall-clock time will appear on roughly that fraction of samples.
On the rest of the samples you will see it doing the other things.
There are automated profilers that do the same thing in a more pretty way, such as Zoom and LTProf, although they don't actually show you the samples.
I looked at the xperf doc, trying to figure out if you could get stack samples on wall-clock time and get percents at line-level resolution. It seems you gotta be on Windows 7 or Vista. They only bother with functions, not lines, which if you have realistically big functions, is important. I couldn't figure out how to get access to the individual samples, which I think is important for seeing why the program is spending its time.

Resources