Go HTTP Server Performance Issue - performance

I am writing an event collector http server which would be under heavy load. Hence in the http handler I am just deserialising the event and then running the actual processing outside of the http request-response cycle in a goroutine.
With this, I see that if I am hitting the server at 400 requests per second, then the latency is under 20ms for 99 percentile. But as soon as I bump the request rate to 500 per second, latency shoots up to over 800ms.
Could anyone please help me with some ideas on what the reason could be so that I can explore more.
package controller
import (
"net/http"
"encoding/json"
"event-server/service"
"time"
)
func CollectEvent() http.Handler {
handleFunc := func(w http.ResponseWriter, r *http.Request) {
startTime := time.Now()
stats.Incr("TotalHttpRequests", nil, 1)
decoder := json.NewDecoder(r.Body)
var event service.Event
err := decoder.Decode(&event)
if err != nil {
http.Error(w, "Invalid json: " + err.Error(), http.StatusBadRequest)
return
}
go service.Collect(&event)
w.Write([]byte("Accepted"))
stats.Timing("HttpResponseDuration", time.Since(startTime), nil, 1)
}
return http.HandlerFunc(handleFunc)
}
I ran a test with 1000 requests per second and profiled it. Following are the results.
(pprof) top20
Showing nodes accounting for 3.97s, 90.85% of 4.37s total
Dropped 89 nodes (cum <= 0.02s)
Showing top 20 nodes out of 162
flat flat% sum% cum cum%
0.72s 16.48% 16.48% 0.72s 16.48% runtime.mach_semaphore_signal
0.65s 14.87% 31.35% 0.66s 15.10% syscall.Syscall
0.54s 12.36% 43.71% 0.54s 12.36% runtime.usleep
0.46s 10.53% 54.23% 0.46s 10.53% runtime.cgocall
0.34s 7.78% 62.01% 0.34s 7.78% runtime.mach_semaphore_wait
0.33s 7.55% 69.57% 0.33s 7.55% runtime.kevent
0.30s 6.86% 76.43% 0.30s 6.86% syscall.RawSyscall
0.10s 2.29% 78.72% 0.10s 2.29% runtime.mach_semaphore_timedwait
0.07s 1.60% 80.32% 1.25s 28.60% net.dialSingle
0.06s 1.37% 81.69% 0.11s 2.52% runtime.notetsleep
0.06s 1.37% 83.07% 0.06s 1.37% runtime.scanobject
0.06s 1.37% 84.44% 0.06s 1.37% syscall.Syscall6
0.05s 1.14% 85.58% 0.05s 1.14% internal/poll.convertErr
0.05s 1.14% 86.73% 0.05s 1.14% runtime.memmove
0.05s 1.14% 87.87% 0.05s 1.14% runtime.step
0.04s 0.92% 88.79% 0.09s 2.06% runtime.mallocgc
0.03s 0.69% 89.47% 0.58s 13.27% net.(*netFD).connect
0.02s 0.46% 89.93% 0.40s 9.15% net.sysSocket
0.02s 0.46% 90.39% 0.03s 0.69% net/http.(*Transport).getIdleConn
0.02s 0.46% 90.85% 0.13s 2.97% runtime.gentraceback
(pprof) top --cum
Showing nodes accounting for 70ms, 1.60% of 4370ms total
Dropped 89 nodes (cum <= 21.85ms)
Showing top 10 nodes out of 162
flat flat% sum% cum cum%
0 0% 0% 1320ms 30.21% net/http.(*Transport).getConn.func4
0 0% 0% 1310ms 29.98% net.(*Dialer).Dial
0 0% 0% 1310ms 29.98% net.(*Dialer).Dial-fm
0 0% 0% 1310ms 29.98% net.(*Dialer).DialContext
0 0% 0% 1310ms 29.98% net/http.(*Transport).dial
0 0% 0% 1310ms 29.98% net/http.(*Transport).dialConn
0 0% 0% 1250ms 28.60% net.dialSerial
70ms 1.60% 1.60% 1250ms 28.60% net.dialSingle
0 0% 1.60% 1170ms 26.77% net.dialTCP
0 0% 1.60% 1170ms 26.77% net.doDialTCP
(pprof)

The problem
I am using another goroutine because I dont want the processing to happen in the http request-response cycle.
That's a common fallacy (and hence trap). The line of reasoning appears to be sound: you're trying to process requests "somewhere else" in an attempt to
handle ingress HTTP requests as fast as possible.
The problem is that that "somewhere else" is still some code which
runs concurrently with the rest of your request-handling churn.
Hence if that code runs slower than the rate of ingress requests,
your processing goroutines will pile up essentially draining one or
more resources. Which exactly—depends on the actual processing:
if it's CPU-bound, it will create natural contention for the CPU
between all those GOMAXPROCS hardware threads of execution;
if it's bound to network I/O, it will create load on Go runtime scheruler which has to divide the available execution quanta it has
on its hands between all those goroutines wanted to be executed;
if it's bound to disk I/O or other syscalls you will have
proliferation of OS threads created, and so on and so on…
Essentially, you are queueing the work units converted from the
ingress HTTP requests, but queues do not fix overload.
They might be used to absorb short spikes of overload,
but this only works when such spikes are "surrounded" by the periods
of load at least slightly below the maximum capacity provided by your
system.
The fact you're queueing is not directly seen in your case, but it's
there, and it's exhibited by pressing your system past its natural
capacity—your "queue" starts to grow indefinitely.
Please read this classic essay carefully to understand why your approach is not going
to work in realistic production setting.
Pay close attention to those pictures of the kitchen sinks.
What to do about it?
Unfortunately, it's almost impossible to give your simple solution
as we're not working with your code in your settings with your workload.
Still, here are a couple of directions to explore.
On the most broad scale, try to see whether you have some easily
discernible bottleneck in your system which you presently cannot see.
For instance, if all those concurrent worker goroutines eventually
talk to an RDBMs instance, its disk I/O may quite easily serialize
all those goroutines which will merely wait for their turn to have
their data accepted.
The bottleneck may be simpler—say, in each worker goroutine
you carelessly execute some long-running operation while holding a lock
contended on by all those goroutines;
this obviously serializes them all.
The next step would be to actually measure (I mean, by writing a benchmark)
how many time does it take for a single worker to complete its unit of work.
Then you need to measure how this number changes when increasing the
concurrency factor.
After collecting these data, you will be able to do
educated projections about at what realistic rate your system
is able to handle the requests.
The next step is to think through your strategy at making your system
fulfil those calculated expectations. Usually this means limiting the rate
of ingress requests. There are different approaches to achieve this.
Look at golang.org/x/time/rate
for a time-based rate limiter but it's possible to start with lower-tech
approaches such as using a buffered channel as a counting semaphore.
The requests which would overflow your capacity may be rejected
(typically with HTTP status code 429, see this).
You might also consider queueing them briefly but I'd try this only
to serve as a cherry on a pie—that is, when you have the rest
sorted out completely.
The question of what to do with rejected requests depends on your
setting. Typically you try to "scale horizontally" by deploying more
than one service to process your requests and teach your clients
to switch over available services. (I'd stress that it means several
independent services—if they all share some target sink which collects
their data, they might be limited by the ultimate capacity of that sink,
and adding more systems won't gain you anything.)
Let me repeat that the general problem has no magic solutions:
if your complete stystem (with this HTTP service you're writing being
merely its front-end, gateway, part) is only able to handle N RPS of load,
no amount of scattering go processRequest() is going to make it
handle requests at a higher pace. The easy concurrency Go offers is not
a silver bullet,
it's a machine gun.

Related

golang GC profiling? runtime.mallocgc appears to be top one; then is moving to sync.Pool the solution?

i have an application written in Go is doing message processing, need to pick up message from network (UDP) at a rate of 20K/second (potentially more), and each message can be up to UDP packet's maximum length (64KB-headersize), the program needs to decode this incoming packet and encode into another format and send to another network;
right now on a 24core+64GB RAM machine it runs ok, but occasionally lost some packets, the programming pattern is already following the pipelines using multiple go-routines / channels and it takes 10% of whole machine cpu load; so it has potential to use more CPU% or RAM to handle all 20K/s messages without losing any one; then i started profiling, following this profiling I found in cpu profile the runtime.mallocgc appears the top one, that is garbage collector runtime, I suspect this GC could be the culprit it hangs for a few millisecond (or some microseconds) and lost some packets, and some best practices says switch to sync.Pool might help, but my switch to pool seemingly results more CPU contention and lost even more packets and more often
(pprof) top20 -cum (sync|runtime)
245.99s of 458.81s total (53.61%)
Dropped 487 nodes (cum <= 22.94s)
Showing top 20 nodes out of 22 (cum >= 30.46s)
flat flat% sum% cum cum%
0 0% 0% 440.88s 96.09% runtime.goexit
1.91s 0.42% 1.75% 244.87s 53.37% sync.(*Pool).Get
64.42s 14.04% 15.79% 221.57s 48.29% sync.(*Pool).getSlow
94.29s 20.55% 36.56% 125.53s 27.36% sync.(*Mutex).Lock
1.62s 0.35% 36.91% 72.85s 15.88% runtime.systemstack
22.43s 4.89% 41.80% 60.81s 13.25% runtime.mallocgc
22.88s 4.99% 46.79% 51.75s 11.28% runtime.scanobject
1.78s 0.39% 47.17% 49.15s 10.71% runtime.newobject
26.72s 5.82% 53.00% 39.09s 8.52% sync.(*Mutex).Unlock
0.76s 0.17% 53.16% 33.74s 7.35% runtime.gcDrain
0 0% 53.16% 33.70s 7.35% runtime.gcBgMarkWorker
0 0% 53.16% 33.69s 7.34% runtime.gcBgMarkWorker.func2
the use of pool is the standard
// create this one globally at program init
var rfpool = &sync.Pool{New: func() interface{} { return new(aPrivateStruct); }}
// get
rf := rfpool.Get().(*aPrivateStruct)
// put after done processing this message
rfpool.Put(rf)
not sure am I doing wrong?
or wonder what are the other ways could tune the GC to use less CPU%? the go version is 1.8
the list shows a lot of lock contention happened in the pool.getSlow src to pool.go at golang.org
(pprof) list sync.*.getSlow
Total: 7.65mins
ROUTINE ======================== sync.(*Pool).getSlow in /opt/go1.8/src/sync/pool.go
1.07mins 3.69mins (flat, cum) 48.29% of Total
. . 144: x = p.New()
. . 145: }
. . 146: return x
. . 147:}
. . 148:
80ms 80ms 149:func (p *Pool) getSlow() (x interface{}) {
. . 150: // See the comment in pin regarding ordering of the loads.
30ms 30ms 151: size := atomic.LoadUintptr(&p.localSize) // load-acquire
180ms 180ms 152: local := p.local // load-consume
. . 153: // Try to steal one element from other procs.
30ms 130ms 154: pid := runtime_procPin()
20ms 20ms 155: runtime_procUnpin()
730ms 730ms 156: for i := 0; i < int(size); i++ {
51.55s 51.55s 157: l := indexLocal(local, (pid+i+1)%int(size))
580ms 2.01mins 158: l.Lock()
10.65s 10.65s 159: last := len(l.shared) - 1
40ms 40ms 160: if last >= 0 {
. . 161: x = l.shared[last]
. . 162: l.shared = l.shared[:last]
. 10ms 163: l.Unlock()
. . 164: break
. . 165: }
490ms 37.59s 166: l.Unlock()
. . 167: }
40ms 40ms 168: return x
. . 169:}
. . 170:
. . 171:// pin pins the current goroutine to P, disables preemption and returns poolLocal pool for the P.
. . 172:// Caller must call runtime_procUnpin() when done with the pool.
. . 173:func (p *Pool) pin() *poolLocal {
sync.Pool runs slowly with a high concurrency load. Try to allocate all the structures once during the startup and use it many times. For example, you may create several goroutines (workers) at start time, instead of run new goroutine on each request. I recommend read this article: https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs .
https://golang.org/pkg/sync/#Pool
a free list maintained as part of a short-lived object is not a
suitable use for a Pool, since the overhead does not amortize well in
that scenario. It is more efficient to have such objects implement
their own free list
You may try to set GOGC value to greater then 100.
https://dave.cheney.net/2015/11/29/a-whirlwind-tour-of-gos-runtime-environment-variables
Or, implement your own free list.
http://golang-jp.org/doc/effective_go.html#leaky_buffer
Go 1.13 (Q4 2019) might change that: see CL 166961.
The original issue was issue 22950: "sync: avoid clearing the full Pool on every GC"
where I find it surprising that there is around 1000 allocations again every cycle. This seems to indicate that the Pool is clearing its entire contents upon every GC.
A peek at the implementation seems to indicate that this is so.
Result:
sync: smooth out Pool behavior over GC with a victim cache
Currently, every Pool is cleared completely at the start of each GC.
This is a problem for heavy users of Pool because it causes an allocation spike immediately after Pools are clear, which impacts both throughput and latency.
This CL fixes this by introducing a victim cache mechanism.
Instead of clearing Pools, the victim cache is dropped and the primary cache is
moved to the victim cache.
As a result, in steady-state, there are (roughly) no new allocations, but if Pool usage drops, objects will still be collected within two GCs (as opposed to one).
This victim cache approach also improves Pool's impact on GC dynamics.
The current approach causes all objects in Pools to be short lived. However, if an application is in steady state and is just going to repopulate its Pools, then these objects impact the live heap size as if they were long lived.
Since Pooled objects count as short lived when computing the GC trigger and goal, but act as long lived objects in the live heap, this causes GC to trigger too frequently.
If Pooled objects are a non-trivial portion of an application's heap, this
increases the CPU overhead of GC. The victim cache lets Pooled objects
affect the GC trigger and goal as long-lived objects.
This has no impact on Get/Put performance, but substantially reduces
the impact to the Pool user when a GC happens.
PoolExpensiveNew demonstrates this in the substantially reduction in the rate at which
the "New" function is called.

How to get a function-duration breakdown in go (profiling)

Update (Jan 24, 2019):
This question was asked 4 years ago about Go 1.4 (and is still getting views). Profiling with pprof has changed dramatically since then.
Original Question:
I'm trying to profile a go martini based server I wrote, I want to profile a single request, and get the complete breakdown of the function with their runtime duration.
I tried playing around with both runtime/pprof and net/http/pprof but the output looks like this:
Total: 3 samples
1 33.3% 33.3% 1 33.3% ExternalCode
1 33.3% 66.7% 1 33.3% runtime.futex
1 33.3% 100.0% 2 66.7% syscall.Syscall
The web view is not very helpful either.
We regularly profile another program, and the output seems to be what I need:
20ms of 20ms total ( 100%)
flat flat% sum% cum cum%
10ms 50.00% 50.00% 10ms 50.00% runtime.duffcopy
10ms 50.00% 100% 10ms 50.00% runtime.fastrand1
0 0% 100% 20ms 100% main.func·004
0 0% 100% 20ms 100% main.pruneAlerts
0 0% 100% 20ms 100% runtime.memclr
I can't tell where the difference is coming from.
pprof is a timer based sampling profiler, originally from the gperftools suite. Rus Cox later ported the pprof tools to Go: http://research.swtch.com/pprof.
This timer based profiler works by using the system profiling timer, and recording statistics whenever it receives SIGPROF. In go, this is currently set to a constant 100Hz. From pprof.go:
// The runtime routines allow a variable profiling rate,
// but in practice operating systems cannot trigger signals
// at more than about 500 Hz, and our processing of the
// signal is not cheap (mostly getting the stack trace).
// 100 Hz is a reasonable choice: it is frequent enough to
// produce useful data, rare enough not to bog down the
// system, and a nice round number to make it easy to
// convert sample counts to seconds. Instead of requiring
// each client to specify the frequency, we hard code it.
const hz = 100
You can set this frequency by calling runtime.SetCPUProfileRate and writing the profile output yourself, and Gperftools allows you to set this frequency with CPUPROFILE_FREQUENCY, but in practice it's not that useful.
In order to sample a program, it needs to be doing what you're trying to measure at all times. Sampling the idle runtime isn't showing anything useful. What you usually do is run the code you want in a benchmark, or in a hot loop, using as much CPU time as possible. After accumulating enough samples, there should be a sufficient number across all functions to show you proportionally how much time is spent in each function.
See also:
http://golang.org/pkg/runtime/pprof/
http://golang.org/pkg/net/http/pprof/
http://blog.golang.org/profiling-go-programs
https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs

Netty TrafficCounter

I am currently using the io.netty.handler.traffic.ChannelTrafficShapingHandler & io.netty.handler.traffic.TrafficCounter to measure performance across a netty client and server. I am consistently see a discrepancy for the value Current Write on the server and Current Read on the client. How can I account for this difference considering the Write/Read KB/s are close to matching all the time.
2014-10-28 16:57:50,099 [Timer-4] INFO PerfLogging 130 - Netty Traffic stats TrafficShaping with Write Limit: 0 Read Limit: 0 and Counter: Monitor ChannelTC431885482 Current Speed Read: 3049 KB/s, Write: 0 KB/s Current Read: 90847 KB Current Write: 0 KB
2014-10-28 16:57:42,230 [ServerStreamingLogging] DEBUG c.f.s.r.l.ServerStreamingLogger:115 - Traffic Statistics WKS226-39843-MTY6NDU6NTAvMDAwMDAw TrafficShaping with Write Limit: 0 Read Limit: 0 and Counter: Monitor ChannelTC385810078 Current Speed Read: 0 KB/s, Write: 3049 KB/s Current Read: 0 KB Current Write: 66837 KB
Is there some sort of compression between client and server?
I can see that my client side value is approximately 3049 * 30 = 91470KB where 30 is the number of seconds where the cumulative figure is calculated
Scott is right, there are some fix around that are also taken this into consideration.
Some explaination:
read is actually real read bandwidth and read bytes account (since the system is not the origin of read reception)
for write events, the system is the source of them and managed them, so there are 2 kinds of writes (and will be in the next fix):
proposed writes which are not yet sent but before the fix taken into account in the bandwidth (lastWriteThroughput) and in the current write (currentWrittenBytes)
real writes when they are effectively pushed to the wire
Currently the issue is that currentWrittenBytes could be higher than real writes since they are mostly scheduled in the future, so they depend on the write speed from the handler which is the source of the write events.
After the fix, we will be more precise on what is "proposed/scheduled" and what is really "sent":
proposed writes taken into consideration into lastWriteThroughput and currentWrittenBytes
real writes operations taken into consideration into realWriteThroughput and realWrittenBytes when the writes occur on the wire (at least on the pipeline)
Now there is a second element, if you set the checkInterval to 30s, this implies the following:
the bandwidth (global average and so control of the traffic) is computed according to those 30s (read or write)
every 30s the "small" counters are reset to 0, while the cumulative counters are not: if you use cumulative counters, you should see that bytes received/sent should be almost the same, while every 30s the "small" counters (currentXxxx) are reset to 0
The smaller the value of this checkInterval, the better the bandwidth, but not too small to prevent too frequent reset and too many thread activities on bandwidth computations. In general, a default of 1s is quite efficient.
The difference seen could be for instance because the 30s event of the sender is not "synchronized" with 30s event of the receiver (and shall not be). So according to your numbers: when receiver (read) is resetting its counters with the 30s event, the writer will resetting its own counters 8s later (24 010 KB).

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.
To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

profiling and RtlUserThreadStart

I am profiling my multi-threaded application with the software "Sleepy". My threads are created through the windows API using a thread pool.
While profiling, it shows a very high amount of time spent in the following functions :
- RtlUserThreadStart (time spent : 46% exclusive, 100% inclusive)
- ZwWaitForMultipleObjects (23% exclusive, 23% inclusive)
- NtWaitForMultipleOvjects (15.4% exclusive, 15.4% inclusive)
- NtDelayExecution (7.7% exclusive, 7.7% inclusive)
- TpWaitForWork (5.8% exclusive, 5.8% inclusive)
- my actual computations (about 2% inclusive).
However, I am not sure whether these thread handling functions are in fact the result of my computations or just some "wasted time" (is RtlUserThreadStart exactly the entry point of my callback functions, in which case the multithreading would be good ?).
In short, does these data show that my multi-threading is useless or not ?
Thanks !

Resources