Ways of optimizing a CPU Intensive Golang WebApp - go

I have a toy web app which is very cpu intensive
func PerfServiceHandler(w http.ResponseWriter, req *http.Request)
{
start := time.Now()
w.Header().Set("Content-Type", "application/json")
x := 0
for i := 0; i < 200000000; i++ {
x = x + 1
x = x - 1
}
elapsed := time.Since(start)
w.Write([]byte(fmt.Sprintf("Time Elapsed %s", elapsed)))
}
func main()
{
http.HandleFunc("/perf", PerfServiceHandler)
http.ListenAndServe(":3000", nil)
}
The above function takes about 120 ms to execute. But when I do a load test this app with 500 concurrent users(siege -t30s -i -v -c500 http://localhost:3000/perf) the results I got
Average Resp Time per request 2.51 secs
Transaction Rate 160.57 transactions per second
Can someone answer my queries below:-
When I ran with 100, 200, 500 concurrent users I saw the no. of OS threads used by the above app got stuck to 35 from 7 when the app was just started. Increasing the no.of concurrent connection does not change this number. Even when 500 concurrent requests arrive at the server the number of OS threads were still stuck at 35 OS threads (The app was started with runtime.GOMAXPROCS(runtime.NumCPU())). When the test stopped the number was still 35.
Can someone explain me this behaviour?
Can the no. of OS threads be increased somehow (from OS or from GOlang)?
Will this improve the performance if no. of OS threads are increased?
Can someone suggest some other ways of optimizing this app?
Environment:-
Go - go1.4.1 linux/amd64
OS - Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux
Processor - 2.6Ghz (Intel(R) Xeon(R) CPU E5-2640 v3 # 2.60GHz)
RAM - 64 GB
OS Parameters -
nproc - 32
cat /proc/sys/kernel/threads-max - 1031126
ulimit -u - 515563
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515563
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 515563
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Multiple goroutines can correspond to a single os thread. The design is described here: https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit, which references this paper: http://supertech.csail.mit.edu/papers/steal.pdf.
On to the questions:
Even when 500 concurrent requests arrive at the server the number of OS threads were still stuck at 35 OS threads [...] Can someone explain me this behaviour?
Since you set GOMAXPROCS to the # of CPUs go will only run that many goroutines at a time.
One thing that may be a little confusing is that goroutines aren't always running (sometimes they are "busy"). For example if you read a file, while the OS is doing that work the goroutine is busy and the scheduler will pick up another goroutine to run (assuming there is one). Once the file read is complete that goroutine goes back into the list of "runnable" goroutines.
The creation of OS level threads is handled by the scheduler and there are additional complexities around system-level calls. (Sometimes you need a real, dedicated thread. See: LockOSThread) But you shouldn't expect a ton of threads.
Can the no. of OS threads be increased somehow (from OS or from GOlang)?
I think using LockOSThread may result in the creation of new threads, but it won't matter:
Will this improve the performance if no. of OS threads are increased?
No. Your CPU is fundamentally limited in how many things it can do at once. Goroutines work because it turns out most operations are IO bound in some way, but if you are truly doing something CPU bound, throwing more threads at the problem won't help. In fact it will probably make it worse, since there is overhead involved in switching between threads.
In other words Go is making the right decision here.
Can someone suggest some other ways of optimizing this app?
for i := 0; i < 200000000; i++ {
x = x + 1
x = x - 1
}
I take it you wrote this code just to make the CPU do a lot of work? What does the actual code look like?
Your best bet will be finding a way to optimize that code so it needs less CPU time. If that's not possible (its already highly optimized), then you will need to add more computers / CPUs to the mix. Get a better computer, or more of them.
For multiple computers you can put a load balancer in front of all your machines and that should scale pretty easily.
You may also benefit by pulling this work off of the webserver and moving it to some backend system. Consider using a work queue.

Related

discrepancy between htop and golang readmemstats

My program loads a lot of data at start up and then calls debug.FreeOSMemory() so that any extra space is given back immediately.
loadDataIntoMem()
debug.FreeOSMemory()
after loading into memory , htop shows me the following for the process
VIRT RES SHR
11.6G 7629M 8000
But a call to runtime.ReadMemStats shows me the following
Alloc 5593336608 5.3G
BuckHashSys 1574016 1.6M
HeapAlloc 5593336610 5.3G
HeapIdle 2607980544 2.5G
HeapInuse 7062446080 6.6G
HeapReleased 2607980544 2.5G
HeapSys 9670426624 9.1G
MCacheInuse 9600 9.4K
MCacheSys 16384 16K
MSpanInuse 106776176 102M
MSpanSys 115785728 111M
OtherSys 25638523 25M
StackInuse 589824 576K
StackSys 589824 576K
Sys 10426738360 9.8G
TotalAlloc 50754542056 48G
Alloc is the amount obtained from system and not yet freed ( This is
resident memory right ?) But there is a big difference between the two.
I rely on HeapIdle to kill my program i.e if HeapIdle is more than 2 GB, restart - in this case it is 2.5, and isn't going down even after a while. Golang should use from heap idle when allocating more in the future, thus reducing heap idle right ?
If assumption 1 is wrong, which stat can accurately tell me what the RES value in htop is.
What can I do to reduce the value of HeapIdle ?
This was tried on go 1.4.2, 1.5.2 and 1.6.beta1
The effective memory consumption of your program will be Sys-HeapReleased. This still won't be exactly what the OS reports, because the OS can choose to allocate memory how it sees fit based on the requests of the program.
If your program runs for any appreciable amount of time, the excess memory will be offered back to the OS so there's no need to call debug.FreeOSMemory(). It's also not the job of the garbage collector to keep memory as low as possible; the goal is to use memory as efficiently as possible. This requires some overhead, and room for future allocations.
If you're having trouble with memory usage, it would be a lot more productive to profile your program and see why you're allocating more than expected, instead of killing your process based on incorrect assumptions about memory.

Need bash script that constantly uses high memory but low cpu?

I am running few experiments to see changes in system behavior under different memory and cpu loads. I was wondering is there a bash script which constantly uses high memory but low CPU?
For the purpose of simulating CPU/memory/IO load, most *NIX systems (Linux included) provide handy tool called stress.
The tool varies from OS to OS. On Linux, to take up 512MB of RAM with low CPU load:
stress --vm 1 --vm-bytes 512M --vm-hang 100
(The invocation means: start one memory thread (--vm 1), allocate/free 512MB of memory in every thread, sleep before freeing memory 100 seconds.)
This is silly, and can't be reasonably expected to provide data which will be useful in any real-world scenario. However, to generate at least the amount of memory consumption associated with a given power-of-two bytes:
build_string() {
local pow=$1
local dest=$2
s=' '
for (( i=0; i<pow; i++ )); do
s+="$s"
done
printf -v "$dest" %s "$s"
}
build_string 10 kilobyte # build a string of length 1024
echo "Kilobyte string consumes ${#kilobyte} bytes"
build_string 20 megabyte # build a string of length 1048576
echo "Megabyte string consumes ${#megabyte} bytes"
Note that transiently, during construction, at least 2x the requested space will be required (for the local); a version that didn't have this behavior would either be using namevars (depending on bash 4.3) or eval (depending on the author's willingness to do evil).

Tuning Netty on 32 Core / 10Gbit Hosts

Netty Server streams to a Netty client (point to point, 1 to 1):
Good
case: Server and Client are both 12 cores, 1Gbit NIC => going at the steady rate of 300K 200 byte messages per second
Not So Good
case: Server and Client are both 32 cores, 10Gbit NIC => (same code) starting at 130K/s degrading down to hundreds per second within minutes
Observations
Netperf shows that the "bad" environment is actually quite excellent ( can stream 600MB/s steady for a half an hour ).
It does not seem to be a client issue, since if I swap the client to a known good client (wrote it in C) that sets a max OS's SO_RCVBUF and does nothing but reads byte[]s and ignores them => the behavior is still the same.
Performance degradation starts before a high write watermark ( 200MB, but tried others ) is reached
Heap feels up quickly, and of course once reaches the max, GC kicks in locking the world, but that happens way after the "bad" symptoms surface. On a "good" environment heap stays steady somewhere at 1Gb, where it logically, given the configs, should be.
One thing that I noticed: most of the 32 cores are utilized while Netty Server streams, which I tried to limit by setting all the Boss/NioWorker threads to 1 (although there is a single channel anyway, but just in case):
val bootstrap = new ServerBootstrap(
new NioServerSocketChannelFactory (
Executors.newFixedThreadPool( 1 ),
Executors.newFixedThreadPool( 1 ), 1 ) )
// 1 thread max, memory limitation: 1GB by channel, 2GB global, 100ms of timeout for an inactive thread
val pipelineExecutor = new OrderedMemoryAwareThreadPoolExecutor(
1, 1 *1024 *1024 *1024, 2 *1024 *1024 *1024, 100, TimeUnit.MILLISECONDS,
Executors.defaultThreadFactory() )
bootstrap.setPipelineFactory(
new ChannelPipelineFactory {
def getPipeline = {
val pipeline = Channels.pipeline( serverHandlers.toArray : _* )
pipeline.addFirst( "pipelineExecutor", new ExecutionHandler( pipelineExecutor ) )
pipeline
}
} )
But that does not limit the number of cores used => still most of the cores are utilized. I understand that Netty tries to round robin worker tasks, but have a suspicion that 32 cores "at once" may be just too much for the NIC to handle.
Question(s)
Suggestions on the degrading performance?
How do I limit the number of cores used by Netty (without of course going the OIO route)?
side notes: would've loved to discuss it on Netty's mailing list, but it is closed. tried Netty's IRC, but it is dead
have you tried cpu/interrupt affinity?
the idea is to send io/irq interrupts into 1 or 2 cores only and prevent context switch in other cores.
give it a good. try vmstat and monitor ctx and inverse context switched before and after.
you may unpin the application from the interrupt handler core(s).

Using time command for benchmarking

I'm trying to use the time command as a simple solution for benchmarking some scripts that do a lot of text processing and makes a number of network calls. To evaluate if its a good fit, I tried doing:
/usr/bin/time -f "\n%E elapsed,\n%U user,\n%S system, \n %P CPU, \n%M
max-mem footprint in KB, \n%t avg-mem footprint in KB, \n%K Average total
(data+stack+text) memory,\n%F major page faults, \n%I file system
inputs by the process, \n%O file system outputs by the process, \n%r
socket messages received, \n%s socket messages sent, \n%x status" yum
install nmap
and got:
1:35.15 elapsed,
3.17 user,
0.40 system,
3% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
127 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
0 socket messages received,
0 socket messages sent,
0 status
which is not exactly what I was expecting - specially the 0 values. Even when I change the command to say ping google.com, the socket messages are 0. What's going on? Is there any alternative?
[And I'm confused if it should stay here or be posted in serverfault]
I think it's not working with Linux; I assume you're using Linux since you said "strace". The manual page says:
Bugs
Not all resources are measured by all versions of Unix,
so some of the values might be reported as zero. The present
selection was mostly inspired by the data provided by 4.2 or
4.3BSD.
I tried "wget" on an OSX system (which is BSD-ish) to check if it report socket statistics, and there at least socket works:
0.00 user,
0.01 system,
1% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
0 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
151 socket messages received,
8 socket messages sent,
0 status
Hope that helps,
Alex.
Do not use time to benchmark. Some of the fields of the time command is broken as specified in [1]. However the basic functionality of time (real , user and cpu time) are still intact.
[1] Maximum resident set size does not make sense

Linux Multi-Threaded Performance Enhancements for File open()

I’m working on tuning performance on a high-performance, high-capacity data engine which ultimately services an end-user web experience. Specifically, the piece delegated to me revolves around characterizing multi-threaded file IO and memory mapping of the data to local cache. In writing test applications to isolate the timing tall-poles, several questions have been exposed. The code has been minimized to perform only a system file open (open(O_RDONLY)) call. I’m hoping that the result of this query helps us understand the fundamental low-level system processes so that a complete predictive (or at least relational) timing model can be understood. Suggestions are always welcome. We’ve seemed to hit a timing barrier, and would like to understand the behavior and determine whether that barrier can be broken.
The test program:
Is written in C, compiled using the gnu C compiler as noted below;
Is minimally written to isolate the discovered issues to a single system file “open()”;
Is configurable to simultaneously launch a requested number of pthreads;
loads a list of 1000 text files of ~8K size;
creates the threads (simply) with no attribute modifications;
each thread performs multiple, sequential file open() calls on the next available file from the pre-determined list of files until the file list is exhausted in such a way that a single thread should open all 1000 files, 2 threads should theoretically open 500 files (not proven as of yet), etc.);
We’ve run tests multiple times, parametrically varying the thread count, file sizes, and whether the files are located on a local or remote server. Several questions have come up.
Observed results (opening remote files):
File open times are higher the first time through (as expected, due to file caching);
Running the test app with one thread to load all the remote files takes X seconds;
It appears that running the app with a thread count between 1 and # of available CPUs on the machine results in times that are proportional to the number of CPUs (nX seconds).
Running the app using a thread count > #CPUs results in run times that seem to level out at the approx same value as the time is takes to run with #CPUs threads (is this coincidental, or a systematic limit, or what?).
Running multiple, concurrent processes (for example, 25 concurrent instances of the same test app) results in the times being approximately linear with number of processes for a selected thread count.
Running app on different servers shows similar results
Observed results (opening files residing locally):
Orders of magnitude faster times (as to be expected);
With increasing the thread count, a LOW timing inflection point occurs at around 4-5 active threads, then increases again until the number of threads equals the CPU count, then levels off again;
Running multiple, concurrent processes (same test) results in the times being approximately linear with number of processes for a constant thread count (same result as #5 above).
Also, we noticed that Local opens take about .01 ms and sequential network opens are 100x slower at 1ms. Opening network files, we get a linear throughput increase up to 8x with 8 threads, but 9+ threads do nothing. The network open calls seem to block after more than 8 simultaneous requests. What we expected was an initial delay equal to the network roundtrip, and then approximately the same throughput as local. Perhaps there is extra mutex locking done on the local and remote systems that takes 100x longer. Perhaps there is some internal queue of remote calls that only holds 8.
Expected results and questions to be answered either by test or by answers from forums like this one:
Running multiple threads would result in the same work done in shorter time;
Is there an optimal number of threads;
Is there a relationship between the number of threads and CPUs available?
Is there some other systematic reason that an 8-10 file limit is observed?
How does the system call to “open()” work in a multi-threading process?
Each thread gets its context-switched time-slice;
Does the open() call block and wait until the file is open/loaded into file cache? Or does the call allow context switching to occur while the operation is in progress?
When the open() completes, does the scheduler reprioritize that thread to execute sooner, or does the thread have to wait until its turn in round-robin way;
Would having the mounted volume on which the 1000 files reside set as read-only or read/write make a difference?
When open() is called with a full path, is each element in the path stat()ed? Would it make more sense to open() a common directory in the list of files tree, and then open() the files under that common directory by relative path?
Development test setup:
Red Hat Enterprise Linux Server release 5.4 (Tikanga)
8-CPUS, each with characteristics as shown below:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU X5460 # 3.16GHz
stepping : 6
cpu MHz : 1992.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips : 6317.47
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
GNU C compiler, version:
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
Not sure if this is one of your issues, but it may be of use.
The one thing that struck me, while optimizing thousands of random reads on a single SATA disk, was that performing non-blocking I/O isn't so easy to do in linux in a clean way, without extra threads.
It is (currently) impossible to issue a non-blocking read() on a block device; i.e. it will block for the 5 ms seek time the disk needs (and 5 ms is an eternity, at 3 GHz). Specifying O_NONBLOCK to open() only served some purpose for backward compatibility, with CD burners or something (this was a rather vague issue). Normally, open() doesn't block or cache anything, it's mostly just to get a handle on a file to do some data I/O later.
For my purposes, mmap() seemed to get me as close to the kernel handling of the disk as possible. Using madvise() and mincore() I was able to fully exploit the NCQ capabilities of the disk, which was simply proved by varying the queue depth of outstanding requests, which turned out to be inversely proportional to the total time taken to issue 10k reads.
Thanks to 64 bit memory addressing, using mmap() to map an entire disk to memory is no problem at all. (on 32 bit platforms, you would need to map the parts of the disk you need using mmap64())

Resources