How to get a function-duration breakdown in go (profiling)

How to get a function-duration breakdown in go (profiling) - go

Update (Jan 24, 2019):
This question was asked 4 years ago about Go 1.4 (and is still getting views). Profiling with pprof has changed dramatically since then.
Original Question:
I'm trying to profile a go martini based server I wrote, I want to profile a single request, and get the complete breakdown of the function with their runtime duration.
I tried playing around with both runtime/pprof and net/http/pprof but the output looks like this:
Total: 3 samples
1 33.3% 33.3% 1 33.3% ExternalCode
1 33.3% 66.7% 1 33.3% runtime.futex
1 33.3% 100.0% 2 66.7% syscall.Syscall
The web view is not very helpful either.
We regularly profile another program, and the output seems to be what I need:
20ms of 20ms total ( 100%)
flat flat% sum% cum cum%
10ms 50.00% 50.00% 10ms 50.00% runtime.duffcopy
10ms 50.00% 100% 10ms 50.00% runtime.fastrand1
0 0% 100% 20ms 100% main.func·004
0 0% 100% 20ms 100% main.pruneAlerts
0 0% 100% 20ms 100% runtime.memclr
I can't tell where the difference is coming from.

pprof is a timer based sampling profiler, originally from the gperftools suite. Rus Cox later ported the pprof tools to Go: http://research.swtch.com/pprof.
This timer based profiler works by using the system profiling timer, and recording statistics whenever it receives SIGPROF. In go, this is currently set to a constant 100Hz. From pprof.go:
// The runtime routines allow a variable profiling rate,
// but in practice operating systems cannot trigger signals
// at more than about 500 Hz, and our processing of the
// signal is not cheap (mostly getting the stack trace).
// 100 Hz is a reasonable choice: it is frequent enough to
// produce useful data, rare enough not to bog down the
// system, and a nice round number to make it easy to
// convert sample counts to seconds. Instead of requiring
// each client to specify the frequency, we hard code it.
const hz = 100
You can set this frequency by calling runtime.SetCPUProfileRate and writing the profile output yourself, and Gperftools allows you to set this frequency with CPUPROFILE_FREQUENCY, but in practice it's not that useful.
In order to sample a program, it needs to be doing what you're trying to measure at all times. Sampling the idle runtime isn't showing anything useful. What you usually do is run the code you want in a benchmark, or in a hot loop, using as much CPU time as possible. After accumulating enough samples, there should be a sufficient number across all functions to show you proportionally how much time is spent in each function.
See also:
http://golang.org/pkg/runtime/pprof/
http://golang.org/pkg/net/http/pprof/
http://blog.golang.org/profiling-go-programs
https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs

Related

How to measure CPU/GPU data transfer overhead in Metal

We have been porting some of our CPU pipeline to Metal to speed up some of the slowest parts with success. However since it is only parts of it we are transferring data back and forth to the GPU and I want to know how much time this actually takes.
Using the frame capture in XCode it informs me that the kernels take around 5-20 ms each, for a total of 149.5 ms (all encoded in the same Command Buffer).
Using Instruments I see some quite different numbers:
The entire thing operations takes 1.62 seconds (Points - Code 1).
MTLTexture replaceRegion takes up the first 180 ms, followed with the CPU being stalled the next 660 ms at MTLCommandBuffer waitUntilCompleted (highlighted area), and then the last 800 ms gets used up in MTLTexture getBytes which maxes out that CPU thread.
Using the Metal instruments I'm getting a few more measurements, 46ms for "Compute Command 0", 460 ms for "Command Buffer 0", and 210 ms for "Page Off". But I'm not seeing how any of this relates to the workload.
The closest thing to an explanation of "Page off" I could find is this:
Texture Page Off Data (Non-AGP)
The number of bytes transferred for texture page-off operations. Under most conditions, textures are not paged off but are simply thrown away since a backup exists in system memory. Texture page-off traffic usually happens when VRAM pressure forces a page-off of a texture that only has valid data in VRAM, such as a texture created using the function glCopyTexImage, or modified using the functiona glCopyTexSubImage or glTexSubImage.
Source: XCode 6 - OpenGL Driver Monitor Parameters
This makes me think that it could be the part that copies the memory off the GPU, but then there wouldn't be a reason getBytes takes that long. And I can't see where the 149.5 ms from XCode should fit into the data from Instruments.
Questions
When exactly does it transfer the data? If this cannot be inferred from the measurements I did, how do I acquire those?
Does the GPU code actually only take 149.5 ms to execute, or is XCode lying to me? If not, then where is the remaining 660-149.5 ms being used?

Performance Analysis of Multiple Kernels (CUDA C)

I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.
But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program.
Should I take the (average or largest value or total) of all kernels for each metric??

One possible approach would be to use a weighted average method.
Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.
Let's also suppose that the profiler reports the gld_efficiency metric as follows:
kernel duration gld_efficiency
1 10ms 88%
2 20ms 76%
3 30ms 50%
You could compute the weighted average as follows:
88*10 76*20 50*30
"overall" global load efficiency = ----- + ----- + ----- = 65%
60 60 60
I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration:
kernel gld_transactions gld_efficiency
1 1000 88%
2 2000 76%
3 3000 50%
88*1000 76*2000 50*3000
"overall" global load efficiency = ------- + ------- + ------- = 65%
6000 6000 6000

Go HTTP Server Performance Issue

I am writing an event collector http server which would be under heavy load. Hence in the http handler I am just deserialising the event and then running the actual processing outside of the http request-response cycle in a goroutine.
With this, I see that if I am hitting the server at 400 requests per second, then the latency is under 20ms for 99 percentile. But as soon as I bump the request rate to 500 per second, latency shoots up to over 800ms.
Could anyone please help me with some ideas on what the reason could be so that I can explore more.
package controller
import (
"net/http"
"encoding/json"
"event-server/service"
"time"
)
func CollectEvent() http.Handler {
handleFunc := func(w http.ResponseWriter, r *http.Request) {
startTime := time.Now()
stats.Incr("TotalHttpRequests", nil, 1)
decoder := json.NewDecoder(r.Body)
var event service.Event
err := decoder.Decode(&event)
if err != nil {
http.Error(w, "Invalid json: " + err.Error(), http.StatusBadRequest)
return
}
go service.Collect(&event)
w.Write([]byte("Accepted"))
stats.Timing("HttpResponseDuration", time.Since(startTime), nil, 1)
}
return http.HandlerFunc(handleFunc)
}
I ran a test with 1000 requests per second and profiled it. Following are the results.
(pprof) top20
Showing nodes accounting for 3.97s, 90.85% of 4.37s total
Dropped 89 nodes (cum <= 0.02s)
Showing top 20 nodes out of 162
flat flat% sum% cum cum%
0.72s 16.48% 16.48% 0.72s 16.48% runtime.mach_semaphore_signal
0.65s 14.87% 31.35% 0.66s 15.10% syscall.Syscall
0.54s 12.36% 43.71% 0.54s 12.36% runtime.usleep
0.46s 10.53% 54.23% 0.46s 10.53% runtime.cgocall
0.34s 7.78% 62.01% 0.34s 7.78% runtime.mach_semaphore_wait
0.33s 7.55% 69.57% 0.33s 7.55% runtime.kevent
0.30s 6.86% 76.43% 0.30s 6.86% syscall.RawSyscall
0.10s 2.29% 78.72% 0.10s 2.29% runtime.mach_semaphore_timedwait
0.07s 1.60% 80.32% 1.25s 28.60% net.dialSingle
0.06s 1.37% 81.69% 0.11s 2.52% runtime.notetsleep
0.06s 1.37% 83.07% 0.06s 1.37% runtime.scanobject
0.06s 1.37% 84.44% 0.06s 1.37% syscall.Syscall6
0.05s 1.14% 85.58% 0.05s 1.14% internal/poll.convertErr
0.05s 1.14% 86.73% 0.05s 1.14% runtime.memmove
0.05s 1.14% 87.87% 0.05s 1.14% runtime.step
0.04s 0.92% 88.79% 0.09s 2.06% runtime.mallocgc
0.03s 0.69% 89.47% 0.58s 13.27% net.(*netFD).connect
0.02s 0.46% 89.93% 0.40s 9.15% net.sysSocket
0.02s 0.46% 90.39% 0.03s 0.69% net/http.(*Transport).getIdleConn
0.02s 0.46% 90.85% 0.13s 2.97% runtime.gentraceback
(pprof) top --cum
Showing nodes accounting for 70ms, 1.60% of 4370ms total
Dropped 89 nodes (cum <= 21.85ms)
Showing top 10 nodes out of 162
flat flat% sum% cum cum%
0 0% 0% 1320ms 30.21% net/http.(*Transport).getConn.func4
0 0% 0% 1310ms 29.98% net.(*Dialer).Dial
0 0% 0% 1310ms 29.98% net.(*Dialer).Dial-fm
0 0% 0% 1310ms 29.98% net.(*Dialer).DialContext
0 0% 0% 1310ms 29.98% net/http.(*Transport).dial
0 0% 0% 1310ms 29.98% net/http.(*Transport).dialConn
0 0% 0% 1250ms 28.60% net.dialSerial
70ms 1.60% 1.60% 1250ms 28.60% net.dialSingle
0 0% 1.60% 1170ms 26.77% net.dialTCP
0 0% 1.60% 1170ms 26.77% net.doDialTCP
(pprof)

The problem
I am using another goroutine because I dont want the processing to happen in the http request-response cycle.
That's a common fallacy (and hence trap). The line of reasoning appears to be sound: you're trying to process requests "somewhere else" in an attempt to
handle ingress HTTP requests as fast as possible.
The problem is that that "somewhere else" is still some code which
runs concurrently with the rest of your request-handling churn.
Hence if that code runs slower than the rate of ingress requests,
your processing goroutines will pile up essentially draining one or
more resources. Which exactly—depends on the actual processing:
if it's CPU-bound, it will create natural contention for the CPU
between all those GOMAXPROCS hardware threads of execution;
if it's bound to network I/O, it will create load on Go runtime scheruler which has to divide the available execution quanta it has
on its hands between all those goroutines wanted to be executed;
if it's bound to disk I/O or other syscalls you will have
proliferation of OS threads created, and so on and so on…
Essentially, you are queueing the work units converted from the
ingress HTTP requests, but queues do not fix overload.
They might be used to absorb short spikes of overload,
but this only works when such spikes are "surrounded" by the periods
of load at least slightly below the maximum capacity provided by your
system.
The fact you're queueing is not directly seen in your case, but it's
there, and it's exhibited by pressing your system past its natural
capacity—your "queue" starts to grow indefinitely.
Please read this classic essay carefully to understand why your approach is not going
to work in realistic production setting.
Pay close attention to those pictures of the kitchen sinks.
What to do about it?
Unfortunately, it's almost impossible to give your simple solution
as we're not working with your code in your settings with your workload.
Still, here are a couple of directions to explore.
On the most broad scale, try to see whether you have some easily
discernible bottleneck in your system which you presently cannot see.
For instance, if all those concurrent worker goroutines eventually
talk to an RDBMs instance, its disk I/O may quite easily serialize
all those goroutines which will merely wait for their turn to have
their data accepted.
The bottleneck may be simpler—say, in each worker goroutine
you carelessly execute some long-running operation while holding a lock
contended on by all those goroutines;
this obviously serializes them all.
The next step would be to actually measure (I mean, by writing a benchmark)
how many time does it take for a single worker to complete its unit of work.
Then you need to measure how this number changes when increasing the
concurrency factor.
After collecting these data, you will be able to do
educated projections about at what realistic rate your system
is able to handle the requests.
The next step is to think through your strategy at making your system
fulfil those calculated expectations. Usually this means limiting the rate
of ingress requests. There are different approaches to achieve this.
Look at golang.org/x/time/rate
for a time-based rate limiter but it's possible to start with lower-tech
approaches such as using a buffered channel as a counting semaphore.
The requests which would overflow your capacity may be rejected
(typically with HTTP status code 429, see this).
You might also consider queueing them briefly but I'd try this only
to serve as a cherry on a pie—that is, when you have the rest
sorted out completely.
The question of what to do with rejected requests depends on your
setting. Typically you try to "scale horizontally" by deploying more
than one service to process your requests and teach your clients
to switch over available services. (I'd stress that it means several
independent services—if they all share some target sink which collects
their data, they might be limited by the ultimate capacity of that sink,
and adding more systems won't gain you anything.)
Let me repeat that the general problem has no magic solutions:
if your complete stystem (with this HTTP service you're writing being
merely its front-end, gateway, part) is only able to handle N RPS of load,
no amount of scattering go processRequest() is going to make it
handle requests at a higher pace. The easy concurrency Go offers is not
a silver bullet,
it's a machine gun.

What is the performance of 10 processors capable of 200 MFLOPs running code which is 10% sequential and 90% parallelelizable?

simple problem from Wilkinson and Allen's Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Working through the exercises at the end of the first chapter and want to make sure that I'm on the right track. The full question is:
1-11 A multiprocessor consists of 10 processors, each capable of a peak execution rate of 200 MFLOPs (millions of floating point operations per second). What is the performance of the system as measured in MFLOPs when 10% of the code is sequential and 90% is parallelizable?
I assume the question wants me to find the number of operations per second of a serial processor which would take the same amount of time to run the program as the multiprocessor.
I think I'm right in thinking that 10% of the program is run at 200 MFLOPs, and 90% is run at 2,000 MFLOPs, and that I can average these speeds to find the performance of the multiprocessor in MFLOPs:
1/10 * 200 + 9/10 * 2000 = 1820 MFLOPs
So when running a program which is 10% serial and 90% parallelizable the performance of the multiprocessor is 1820 MFLOPs.
Is my approach correct?
ps: I understand that this isn't exactly how this would work in reality because it's far more complex, but I would like to know if I'm grasping the concepts.

Your calculation would be fine if 90% of the time, all 10 processors were fully utilized, and 10% of the time, just 1 processor was in use. However, I don't think that is a reasonable interpretation of the problem. I think it is more reasonable to assume that if a single processor were used, 10% of its computations would be on the sequential part, and 90% of its computations would be on the parallelizable part.
One possibility is that the sequential part and parallelizable parts can be run in parallel. Then one processor could run the sequential part, and the other 9 processors could do the parallelizable part. All processors would be fully used, and the result would be 2000 MFLOPS.
Another possibility is that the sequential part needs to be run first, and then the parallelizable part. If a single processor needed 1 hour to do the first part, and 9 hours to do the second, then it would take 10 processors 1 + 0.9 = 1.9 hours total, for an average of about (1*200 + 0.9*2000)/1.9 ~ 1053 MFLOPS.

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.

To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio