Tuning Netty on 32 Core / 10Gbit Hosts - performance

Netty Server streams to a Netty client (point to point, 1 to 1):
Good
case: Server and Client are both 12 cores, 1Gbit NIC => going at the steady rate of 300K 200 byte messages per second
Not So Good
case: Server and Client are both 32 cores, 10Gbit NIC => (same code) starting at 130K/s degrading down to hundreds per second within minutes
Observations
Netperf shows that the "bad" environment is actually quite excellent ( can stream 600MB/s steady for a half an hour ).
It does not seem to be a client issue, since if I swap the client to a known good client (wrote it in C) that sets a max OS's SO_RCVBUF and does nothing but reads byte[]s and ignores them => the behavior is still the same.
Performance degradation starts before a high write watermark ( 200MB, but tried others ) is reached
Heap feels up quickly, and of course once reaches the max, GC kicks in locking the world, but that happens way after the "bad" symptoms surface. On a "good" environment heap stays steady somewhere at 1Gb, where it logically, given the configs, should be.
One thing that I noticed: most of the 32 cores are utilized while Netty Server streams, which I tried to limit by setting all the Boss/NioWorker threads to 1 (although there is a single channel anyway, but just in case):
val bootstrap = new ServerBootstrap(
new NioServerSocketChannelFactory (
Executors.newFixedThreadPool( 1 ),
Executors.newFixedThreadPool( 1 ), 1 ) )
// 1 thread max, memory limitation: 1GB by channel, 2GB global, 100ms of timeout for an inactive thread
val pipelineExecutor = new OrderedMemoryAwareThreadPoolExecutor(
1, 1 *1024 *1024 *1024, 2 *1024 *1024 *1024, 100, TimeUnit.MILLISECONDS,
Executors.defaultThreadFactory() )
bootstrap.setPipelineFactory(
new ChannelPipelineFactory {
def getPipeline = {
val pipeline = Channels.pipeline( serverHandlers.toArray : _* )
pipeline.addFirst( "pipelineExecutor", new ExecutionHandler( pipelineExecutor ) )
pipeline
}
} )
But that does not limit the number of cores used => still most of the cores are utilized. I understand that Netty tries to round robin worker tasks, but have a suspicion that 32 cores "at once" may be just too much for the NIC to handle.
Question(s)
Suggestions on the degrading performance?
How do I limit the number of cores used by Netty (without of course going the OIO route)?
side notes: would've loved to discuss it on Netty's mailing list, but it is closed. tried Netty's IRC, but it is dead

have you tried cpu/interrupt affinity?
the idea is to send io/irq interrupts into 1 or 2 cores only and prevent context switch in other cores.
give it a good. try vmstat and monitor ctx and inverse context switched before and after.
you may unpin the application from the interrupt handler core(s).

Related

fio -numjobs bigger, the iops will be smaller, the reason is?

fio -numjobs=8 -directory=/mnt -iodepth=64 -direct=1 -ioengine=libaio -sync=1 -rw=randread -bs=4k
FioTest: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
iops: (8 threads and iodepth=64)-> 356, 397, 399, 396, ...
but when -numjobs=1 and iodepth=64, the iops -> 15873
I feel a little confused. Why the -numjobs larger, the iops will be smaller?
It's hard to make a general statement because the correct answer depends on a given setup.
For example, imagine I have a cheap spinning SATA disk whose sequential speed is fair but whose random access is poor. The more random I make the accesses the worse things get (because of the latency involved in each I/O being serviced - https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html suggests 3ms is the cost of having to seek). So 64 simultaneous random access is bad because the disk head is seeking to 64 different locations before the last I/O is serviced. If I now bump the number of jobs up to 8 that 64 * 8 = 512 means even MORE seeking. Worse, there are only so many simultaneous I/Os that can actually be serviced at any given time. So the disk's queue of in-flight simultaneous I/Os can become completely full, other queues start backing up, latency in turn goes up again and IOPS start tumbling. Also note this is compounded because you're prevent the disk saying "It's in my cache, you can carry on" because sync=1 forces the I/O to have to be on non-volatile media before it is marked as done.
This may not be what is happening in your case but is an example of a "what if" scenario.
I think you should add '--group_reporting' on your fio command.
group_reporting
If set, display per-group reports instead of per-job when numjobs is specified.

Python Multiprocessing slight slower than Multithreading on Windows

I have been experimenting my code to send "parallel" commands to multiple serial COM ports.
My multi-threading code consists of:
global q
q = Queue()
devices = [0, 1, 2, 3]
for i in devices:
q.put(i)
cpus=cpu_count() #detect number of cores
logging.debug("Creating %d threads" % cpus)
for i in range(cpus):
t = Thread(name= 'DeviceThread_'+str(i), target=testFunc1)
t.daemon = True
t.start()
and multi-processing code consists of:
devices = [0, 1, 2, 3]
cpus=cpu_count() #detect number of cores
pool = Pool(cpus)
results = pool.map(multi_run_wrapper, devices)
I observe that the task of sending serial commands to 4 COM ports in "parallel" takes about 6 seconds and multi-processing always always takes a 0.5 to 1 second of additional total run time.
Any inputs on why the discrepancy on a Windows machine?
Well, for one, you're not comparing apples to apples. If you want equivalent code, use multiprocessing.dummy.Pool in your threaded case (which is the same as multiprocessing.Pool implemented in terms of threads, not processes), so you're at least using the same basic parallelization model with different internal implementations, not changing everything all at once.
Beyond that, launching the workers and communicating data to them has some overhead, more on Windows than on other systems since Windows can't fork to spawn new processes cheaply; it has to spawn a new Python instance and then copy over state via IPC to approximate forking.
Aside from that, you haven't provided enough information; your process and thread based worker functions aren't provided, and could cause significant differences in behavior. Nor have you provided information on how you're performing timing. Similarly, if each worker process needs to reinitialize the COM port communication library, that could involve non-trivial overhead.

Python 3 multiprocessing: optimal chunk size

How do I find the optimal chunk size for multiprocessing.Pool instances?
I used this before to create a generator of n sudoku objects:
processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.
I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).
If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.
Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.
This answer provides a high level overview.
Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.
You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.
The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.
Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).
Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.
Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.

Ways of optimizing a CPU Intensive Golang WebApp

I have a toy web app which is very cpu intensive
func PerfServiceHandler(w http.ResponseWriter, req *http.Request)
{
start := time.Now()
w.Header().Set("Content-Type", "application/json")
x := 0
for i := 0; i < 200000000; i++ {
x = x + 1
x = x - 1
}
elapsed := time.Since(start)
w.Write([]byte(fmt.Sprintf("Time Elapsed %s", elapsed)))
}
func main()
{
http.HandleFunc("/perf", PerfServiceHandler)
http.ListenAndServe(":3000", nil)
}
The above function takes about 120 ms to execute. But when I do a load test this app with 500 concurrent users(siege -t30s -i -v -c500 http://localhost:3000/perf) the results I got
Average Resp Time per request 2.51 secs
Transaction Rate 160.57 transactions per second
Can someone answer my queries below:-
When I ran with 100, 200, 500 concurrent users I saw the no. of OS threads used by the above app got stuck to 35 from 7 when the app was just started. Increasing the no.of concurrent connection does not change this number. Even when 500 concurrent requests arrive at the server the number of OS threads were still stuck at 35 OS threads (The app was started with runtime.GOMAXPROCS(runtime.NumCPU())). When the test stopped the number was still 35.
Can someone explain me this behaviour?
Can the no. of OS threads be increased somehow (from OS or from GOlang)?
Will this improve the performance if no. of OS threads are increased?
Can someone suggest some other ways of optimizing this app?
Environment:-
Go - go1.4.1 linux/amd64
OS - Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux
Processor - 2.6Ghz (Intel(R) Xeon(R) CPU E5-2640 v3 # 2.60GHz)
RAM - 64 GB
OS Parameters -
nproc - 32
cat /proc/sys/kernel/threads-max - 1031126
ulimit -u - 515563
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515563
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 515563
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Multiple goroutines can correspond to a single os thread. The design is described here: https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit, which references this paper: http://supertech.csail.mit.edu/papers/steal.pdf.
On to the questions:
Even when 500 concurrent requests arrive at the server the number of OS threads were still stuck at 35 OS threads [...] Can someone explain me this behaviour?
Since you set GOMAXPROCS to the # of CPUs go will only run that many goroutines at a time.
One thing that may be a little confusing is that goroutines aren't always running (sometimes they are "busy"). For example if you read a file, while the OS is doing that work the goroutine is busy and the scheduler will pick up another goroutine to run (assuming there is one). Once the file read is complete that goroutine goes back into the list of "runnable" goroutines.
The creation of OS level threads is handled by the scheduler and there are additional complexities around system-level calls. (Sometimes you need a real, dedicated thread. See: LockOSThread) But you shouldn't expect a ton of threads.
Can the no. of OS threads be increased somehow (from OS or from GOlang)?
I think using LockOSThread may result in the creation of new threads, but it won't matter:
Will this improve the performance if no. of OS threads are increased?
No. Your CPU is fundamentally limited in how many things it can do at once. Goroutines work because it turns out most operations are IO bound in some way, but if you are truly doing something CPU bound, throwing more threads at the problem won't help. In fact it will probably make it worse, since there is overhead involved in switching between threads.
In other words Go is making the right decision here.
Can someone suggest some other ways of optimizing this app?
for i := 0; i < 200000000; i++ {
x = x + 1
x = x - 1
}
I take it you wrote this code just to make the CPU do a lot of work? What does the actual code look like?
Your best bet will be finding a way to optimize that code so it needs less CPU time. If that's not possible (its already highly optimized), then you will need to add more computers / CPUs to the mix. Get a better computer, or more of them.
For multiple computers you can put a load balancer in front of all your machines and that should scale pretty easily.
You may also benefit by pulling this work off of the webserver and moving it to some backend system. Consider using a work queue.

Why Garbage Collect in web apps?

Consider building a web app on a platform where every request is handled by a User Level Thread(ULT) (green thread/erlang process/goroutine/... any light weight thread). Assuming every request is stateless and resources like DB connection are obtained at startup of the app and shared between these threads. What is the need for garbage collection in these threads?
Generally such a thread is short running(a few milliseconds) and if well designed doesn't use more than a few (KB or MB) of memory. If garbage collection of the resources allocated in the thread is done at the exit of the thread and independent of the other threads, then there would be no GC pauses for even the 98th or 99th percentile of requests. All requests would be answered in predictable time.
What is the problem with such a model and why is it not being widely used?
You assumption might not be true.
if well designed doesn't use more than a few (KB or MB) of memory
Imagine a function for counting words in a text file which is used in a web app. Some naive implementation could be,
def count_words(text):
words = text.split()
count = {}
for w in words:
if w in count:
count[w] += 1
else:
count[w] = 1
return count
It allocates larger memory than text.

Resources