Docker service poor network performance - performance

I've got a go service (http.ListenAndServe) that simply echoes 'hello world' (the most basic service in order to not introduce overhead to the benchmark). The question is, if I run the service with go run server.go and run a performance test (using wrk https://github.com/wg/wrk) on my local machine (Macbook pro), I got a performance of
Requests/sec: 59336.07
Transfer/sec: 8.66MiB
but if I just run the service in docker (docker run -p8080:8080 periket2000/goku) and run the performance test again, I got:
Requests/sec: 4743.77
Transfer/sec: 0.69MiB
Does it make sense? It's docker so poor in performance?
I've tested it with several services/stacks and got the same results.

The service you describe does extremely little, and from looking at its GitHub page it looks like wrk goes out of its way to avoid things like HTTP persistent connections, so your benchmark basically only tests how fast the combined system can set up and tear down TCP connections. Note that the data rate, even in the loopback case, is really low: there's a lot of handshaking and IP and TCP overhead and HTTP headers before you get to say "hello world" and hang up.
In the loopback case it's highly likely that the MacOS kernel can funnel data almost directly from one process to the other.
Process A |--> Local loopback ----> | Process B
In the Docker case, traffic needs to go over a virtual network interface, into a virtual machine, getting received by a Linux-kernel network interface, traversing a NAT layer into Docker space, and then gets received by the process; replies need to go through this stack in reverse. It's not surprising that this has more overhead, and the tiny packets involved make it worse (see the discussion of connection establishment in the Wikipedia page on TCP; there is waiting for each side to complete the SYN-ACK and FIN-RST sequences and those don't carry data IIRC). So, these results aren't surprising.
Process A |--> Virtual interface
Linux VM eth0
iptables NAT
Linux VM docker0
Container eth0 ----> | Process B
In absolute numbers, in systems I've worked with in the past, getting over 4,000 requests per second definitely gets into the "respectable performance" space. (The loopback-only 8 MB/s data rate doesn't look stellar, because of the nature of the workload.) It might be worth re-running the benchmark with a more realistic workload (mix of GETs and POSTs; non-trivial data responses; some computation or external I/O required to produce results). I wouldn't expect Docker for Mac to hit 100% of the performance of a purely-native no-external-network setup, but I would expect larger data-to-overhead ratios and things like HTTP persistent connections to help.

Related

How to increase the request per second on amazon EC2 T2.micro instance?

I recently lunched a Amazon EC2 instance, the T2.micro. After installed Wildfly 8.2.0Final, I try to do a load test of the web server. I tested the server to serve a static page of less than 500 byte size, and a dynamic page that write and read mysql. To my suprise, I got the similar result, both test get the result of around 1000 RPS. I monitored the system using top -d 1, the CPU hasn't reach the max, and there are free memory. I think either EC2 has some limitation on concurrent connections, or my setup needs improvement.
My setup is CentOS 7, WileFly/Jboss 8.2.0 Final, MariaDb 5.5. The test tool is jmeter in distributed mode or command line mode. Tests were performed on remote, on the same subnet, and on the localhost. All get the same result.
Can you please help identify where the bottleneck is. Are there any limitations on Amazon EC2 instance that could affect this? Thanks.
Yes, there are some limitations depending of the EC2 instance type and one of them is network performance.
Amazon doesn't publish the exact limitations of each type of instance, but in the Instance Types Matrix you can see that t2.micro has a low to moderate network performance. If you need better network performance, you can check on the AWS instance types page where it shows which instances have enhanced networking:
Enhanced Networking
Enhanced Networking enables you to get significantly higher packet per second (PPS) performance, lower network jitter and lower latencies. This feature uses a new network virtualization stack that provides higher I/O performance and lower CPU utilization compared to traditional implementations. In order to take advantage of Enhanced Networking, you should launch an HVM AMI in VPC, and install the appropriate driver. Enhanced Networking is currently supported in C4, C3, R3, I2, M4, and D2 instances. For instructions on how to enable Enhanced Networking on EC2 instances, see the Enhanced Networking on Linux and Enhanced Networking on Windows tutorials. To learn more about this feature, check out the Enhanced Networking FAQ section.
You have more information in these SO and SF questions:
Bandwidth limits for Amazon EC2
Does anyone know the bandwidth available for different EC2 Instances?
EC2 Instance Types's EXACT Network Performance?
You're right that 1000 RPS feels awfully low for Wildfly, given that the Undertow server powering it is one of the fastest in Java land and among the 10 fastest, period.
Starting points to optimize:
Make sure that you do not have request logging on (that could cause an I/O bottleneck), use the latest stable JVM, and it's probably worth using the most recent Wildfly version that your app works with.
With that done, you're almost certainly being bottlenecked by connection creation, not your AWS instance. This could be within JMeter, or within the Wildfly subsystem.
To eliminate JMeter as a culprit, try ApacheBenchmark ("ab") at the same concurrency level, and then try it with the -k option on (to allow connection reuse).
If the first ApacheBenchmark number is much higher than JMeter, the issue is the thread-based networking model that JMeter uses (Another load-testing tool, such as gatling or locust.io may be needed).
If the second number is much higher than the first, the bottleneck is proven to be connection creation. The may be solved by tuning the Undertow server settings.
As far as WildFly goes, I'd have to see the config.xml, but you may be able to improve performance by tweaking the Undertow subsystem settings. The defaults are usually solid, but you want a very low number of I/O threads (either 1, or the number of CPUs, no more).
I have seen a trivial Wildfly 10 application far exceed the performance you're seeing on a t2.micro instance.
Benchmark results, with Wildfly 10 + docker + Java 8:
Server setup (EC2 t2.micro running latest amazon linux, in US-east-1, different AZs)
sudo yum install docker
sudo service docker start
sudo docker run --rm -it -p 8080:8080 svanoort/jboss-demo-app:0.7-lomem
Client (another t2.micro, minimal load, different AZ):
ab -c 16 -k -n 1000 http://$SERVER_PRIVATE_IP:8080/rest/cached/500
16 concurrent connections with keep-alive, serving 500 bytes of cached randomly pre-generated data
Results over multiple runs:
430 requests per second (RPS), 1171 RPS, 1527 RPS, 1686 RPS, 1977 RPS, 2471 RPS, 3339 RPS, eventually peaking at ~6500 RPS after hundreds of thousands of requests.
Notice how that goes up over time? It's important to prewarm the server before benchmarking, to allow for enough handler threads to be created, and to allow for JIT compilation. 10,000 requests is a good starting point.
If I turn off connection keepalive? Peaks at about ~1450 RPS with concurrency 16. BUT WAIT! With a single thread (concurrency 1), it only gives ~340-350 RPS. Increasing concurrency beyond 16 does not give higher performance, it remains fairly stable (even up to 512 concurrent connections).
If I increase the request data size to 2000 bytes, by using http://$SERVER_PRIVATE_IP:8080/rest/cached/2000 then it still hits 1367 RPS, showing that almost all of the time is spent on connection handling.
With very large (300k) requests and connection keep-alive, I hit about 50 MB/s between hosts, but I've seen up to 90 MB/s in optimal situations.
Very impressive performance for JBoss/Wildfly there, I'd say. Note that higher concurrency may be needed if there is more latency between hosts, to allow for the impact of round-trip time on connection creation.

Simulate slow speed for TCP sockets in Windows

I'm building an application that uses TCP sockets to communicate. I want to test how it behaves under slow-speed conditions.
There are similar question on the site, but as I understand it, they deal with HTTP traffic, or are about Linux. My traffic is not HTTP, just ordinary TCP sockets, and the OS is Windows.
I tried using fiddler's setting for Modem Speed but it didn't work, it seems to work only for HTTP connections.
While it is true that you probably want to invest in an extensive set of unit tests, You can simulate various network conditions using VMWare Workstation:
You will have to install a virtual machine for testing, setup bridged networking (for the vm to access your real network) and upload your code to the vm.
After that you can start changing the settings and see how your application performs.
NetLimiter can also be used, but it has fewer options (in your case, packet loss is very interesting to test and is not available in netlimiter).
There is an excellent utility for Windows that can do throttling and much more:
https://jagt.github.io/clumsy/
I think you're taking the wrong approach here.
You can achieve everything that you need with some well designed unit tests. All of the things that a slow network link causes can be simulated in a unit test environment in controlled conditions.
Things that your code MUST handle to deal with "slow" links are just things that you should be dealing with anyway, including:
The correct handling of fragmented messages. All of your network reading code needs to correctly assume that each read will return between 1 byte and the size of your read buffer. You should never assume that you'll get complete 'messages' as TCP knows nothing of your concept of messages.
TCP flow control causing either your synchronous sends to fail with some form of 'try later' error or your async sends to succeed and potentially use an uncontrolled amount of resources (see here for more details). Note that this can happen even on 'fast' links if you are sending faster than the receiver is consuming.
Timeouts - again this isn't limited to "slow" links. All of your timeout handling code should be robust and tested. You may want to make sure that any read timeout is based on any read completing rather than reading a complete message in x time. You may be getting your data at a slow rate but whilst you're still getting data the link is alive.
Connection failure - again not something specific to "slow" links. You need to know how you deal with connections being reset at any time.
In summary nothing you can achieve by running your client and server on a simulated slow network cannot be achieved with a decent set of unit tests and everything that you would want to test on such a link is something that could affect any of your connections on any speed of link.

How slow are TCP sockets compared to named pipes on Windows for localhost IPC?

I am developing a TCP Proxy to be put in front of a TCP service that should handle between 500 and 1000 active connections from the wild Internet.
The proxy is running on the same machine as the service, and is mostly-transparent. The service is for the most part unaware of the proxy, the only exception being the notification of the real remote IP address of the clients.
This means that, for every inbound open TCP socket, there are two more sockets on the server: the secondth of the pair in the Proxy, and the one on the real service behind the proxy.
The send and recv window sizes on the two Proxy sockets are set to 1024 bytes.
What are the performance implications on this? How slow is this configuration? Should I put some effort on changing the service to use Named Pipes (or other IPC mechanism), or a localhost TCP socket is for the most part an efficient IPC?
The merge of the two apps is not an option. Right now we are stuck with the two process configuration.
EDIT: The reason for having two separate process on the same hardware is 100% economics. We have one server only, and we are not planning on getting more (no money).
The TCP service is a legacy software in Visual Basic 6 which grew beyond our expectations. The proxy is C++. We don't have the time, money nor manpower to rewrite and migrate the VB6 code to a modern programming environment.
The proxy is our attempt to mitigate a specific performance issue on the service, a DDoS attack we are getting from time to time.
The proxy is open source, and here is the project source code.
It will be the same (or at least not measurably different). Winsock is smart enough to know if it's talking to a socket on the same host and, in that case, it will short-circuit pretty much everything below IP and copy data directly buffer-to-buffer. In terms of named pipes vs. sockets, if you need to potentially be able to communicate to different machines ever in the future, choose sockets. If you know for a fact that you'll never need to do that, pick whichever one your developers are most familiar or most comfortable with.
For anyone that comes to read this later, I want to add some findings that answer the original question.
For a utility we are developing we have a networking class that can use named pipes, or TCP with the same calls.
Here is a typical loop back file transfer on our test system:
TCP/IP Transfer time: 2.5 Seconds
Named Pipes Transfer time: 3.1 Seconds
Now, if you go outside the machine and connect to a remote computer on your network the performance for named pipes is much worse:
TCP/IP Transfer time: 12 Seconds
Named Pipes Transfer time: 2.5 Minutes (Yes Minutes!)
I realize that this is just one system (Windows 7) But I think it is a good indicator of how slow named pipes can be...and it seems like TCP is the way to go.
I know this topic is very old, but it was still relevant for me, and maybe others will look at this in the future as well.
I implemented IPC between Excel (VBA) and another process on the same machine, both via a TCP connection as well as via Named Pipes.
In a quick performance test, I submitted a message than consisted of 26 bytes from client (Excel) to server (not Excel), and waited for the reply message from the other process (which consisted of 12 bytes in the example).
I executed this a ton of times in a loop and measured the average execution time.
With TCP on localhost (Windows 7, no fastpath), one "conversation" (request+reply) took around 300-350 microseconds. Especially sending data was quite slow (sending the 26 bytes took around 200microseconds via TCP).
With Named Pipes, one conversation took around 60 microseconds on average - so a LOT faster.
I'm not entirely sure why the difference was so large. The corporate environment I tested this in has a strict firewall, package inspections and what not, so I THINK this may have been caused as even the localhost-based TCP connection went through security measures significantly slowing it down, while named pipe ones likely did not.
TL:DR: In my case, Named Pipes were around 5-6 times faster than TCP for small packages (have not tested with bigger ones yet)
http://msdn.microsoft.com/en-us/library/aa178138(v=sql.80).aspx
Let me sum it up for you. If you are worried about performance then use TCP/IP. But if you have a really fast network and your not worried about performance then Named Pipes would be "neat" in that it might save you some code.
Not to mention, if you stick to TCP then you will have something that can be scaled, and even load balanced when the time comes.
Cheers,
In the scenario you describe, the local TCP connections are very unlikely to be a bottleneck. It will introduce some overhead, of course, but this should be negligible unless your CPU is already running hot.
At a guess, if your server's CPU usage is normally below 50% or so (with the proxy in place) it isn't worth worrying about minimizing the overhead associated with the local TCP connections.
If CPU usage is regularly above 80% you should probably be doing some profiling. I'd start by comparing the CPU load (or, better still, the performance, if you can measure it meaningfully) when the proxy is in place to when it isn't. Unless the proxy is doing some complicated processing, the overhead associated with the extra TCP connections is probably a significant fraction of the total overhead introduced by the proxy, so that should give you at least an order-of-magnitude estimate of how much you'd gain by using a more efficient form of IPC.
What is the reason to have a proxy on the SAME machine, just curious?
Anyway:
There are several methods for IPC, TCP/IP, named Pipes are comparable in speed and complexity. If you really want something that scales well and has almost no overhead: use shared memory. Best used in combination with a lock free algorithm for advancing the pointers (or use one buffer for each reader (the proxy/the service) and writer(the service/the proxy)).

Low latency/high performance network (ethernet) messaging

Background
I want to create a test application to test the network performance of different systems. To do this I plan to have that machine send out Ethernet frames over a private (otherwise non-busy) network to another machine (or device) that simply receives the message and sends it back. The sending application will record total roundtrip time (among other things).
The purpose of the tests is to see how a particular system (OS + components etc.) performs when it comes to networking traffic. This is illustrated as machine A in the picture below. Note that I'm not interested in the performance of the networking infrastructure (switches, cables etc) - I'm trying to test the performance of network traffic inside Machine A (i.e from when it hits the network card until it reaches user space)
We will (try to) measure all kind of things, one thing is the total roundtrip of the message but also things like interrupt latency of Machine A, general driver overhead etc. Machine A will be a real-time system. But to support these tests, I need a separate machine that can bounce back messages and in other ways add network stimuli to the tested system. This separate machine is Machine B in the picture below and is what this question is about.
My problem
I want to develop an application that can receive and return these messages with as consistent (and preferably low) latency as possible. I'm hoping to get latencies that are consistent within a few microseconds at least. For simplicity, I'd like to do this on a general purpose OS like Windows or Linux but I'm open for other suggestions. There will be no other load (CPU or otherwise) on the machine besides the operating system and my test application.
I've thought of the following approaches:
A normal application running in user space with a high priority
A thread running in kernel space to avoid the userspace/kernelspace transitions
An of-the-shelf device that already does this (I haven't found one though)
Questions
Are there any other approaches or perhaps frameworks that already does this? What else do I need to think of to gain a consistent and low latency? What approach is recommended?
You mentioned that you want to test the internal performance of Machine A, but "need a separate machine"; yet, you don't want to test network infrastructure performance.
You know much more about your requirements than I do; however, if I was testing network infrastructure in Machine A, I would set up my test like this:
There are couple of reasons for this:
You can use an Ethernet loopback cable to simulate the "pong" function performed by Machine B
Eliminating transit through infrastructure you don't care about is almost always a better solution when measuring performance
If you use this test method, be sure to note these points:
Ethernet performs a signal to noise test on the copper before it sets up a link. If you make your loopback bends too tight, you could introduce more latency if ethernet decides to fall back to a lower speed due to the kinks in the cable. There is no minimum length for copper ethernet cabling.
As you're probably aware, combinations of NICs / driver versions / OS can have a significant affect on intra-host latency. I work for a network equipment manufacturer, and one of the guys in the office used to work as an applications engineer for SolarFlare. He claims that many of the Wall Street trading systems use SolarFlare's NICs due to the low latency SolarFlare engineers their products for; he also said SolarFlare's drivers give you user-space access to the NIC buffers. Caveat: third-hand info, and I cannot verify myself.
If you loop the frames to Machine A, set the source and destination mac-address to the burned-in-address on the NIC
Even if you need to receive a modified "pong" frame from Machine B, you could still use this topology and simply rewrite packet fields on the receive-side of your code in Machine A. Put as many (or few) instrumentation points as you like in Machine A's "modules" to compare frame timestamps.
FYI:
The embedded systems I mentioned in my comments on your question are for measuring latency of network infrastructure, not end hosts. This is the best method I can think of for instrumenting host latency.
As an off the shelf solution, I would suggest taking a look at Solace, Tibco and AMQP. These are all enterprise messaging frameworks used extensively in trading applications. AMQP is open source and capable of handling throughputs of up to 100,000 messages per second. I am not sure of the latencies of other frameworks. There is a Java or C++ implementation of the AMQP message router. The C++ one of course returns higher performance.
Edit I've just heard of a new product called UltraMessaging which can provide 7,000,000 messages per second throughput with Java, C++ or C# clients. Crikey.
Best regards,

Simulating Slow Internet Connection [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 3 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I know this is kind of an odd question. Since I usually develop applications based on the "assumption" that all users have a slow internet connection. But, does anybody think that there is a way to programmatically simulate a slow internet connection, so I can "see" how an application performs under various "connection speeds"?
I'm not worried about which language is used. And I'm not looking for code samples or anything, just interested in the logic behind it.
Starting with Chrome 38 you can do this without any plugins. Just click inspect element (or F12 hotkey), then click on "toggle device mod" and you will see something like this:
Among many other features it allows you to simulate specific internet connection (3G, GPRS)
P.S. for people who try to limit the upload speed. Sadly at the current time it is not possible.
P.S.2 now you do not need to toggle anything. Throttling panel is available right from the network panel.
Note that while clicking on the No throttling you can create your custom throttling options.
If you're running windows, fiddler is a great tool. It has a setting to simulate modem speed, and for someone who wants more control has a plugin to add latency to each request.
I prefer using a tool like this to putting latency code in my application as it is a much more realistic simulation, as well as not making me design or code the actual bits. The best code is code I don't have to write.
ADDED: This article at Pavel Donchev's blog on Software Technologies shows how to create custom simulated speeds: Limiting your Internet connection speed with Fiddler.
Google recommends:
Network Link Conditioner on OSX
Clumsy on Windows
Dummynet on Linux
On Linux machines u can use wondershaper
apt-get install wondershaper
$ sudo wondershaper {interface} {down} {up}
the {down} and {up} are bandwidth in kpbs
So for example if you want to limit the bandwidth of interface eth1 to 256kbps uplink and 128kbps downlink,
$ sudo wondershaper eth1 256 128
To clear the limit,
$ sudo wondershaper clear eth1
I was using http://www.netlimiter.com/ and it works very well. Not only limit speed for single processes but also shows actual transfer rates.
There are TCP proxies out there, like iprelay and Sloppy, that do bandwidth shaping to simulate slow connections. You can also do bandwidth shaping and simulate packet loss using IP filtering tools like ipfw and iptables.
You can try Dummynet, it can simulates queue and bandwidth limitations, delays, packet losses, and multipath effects
Use a web debugging proxy with throttling features, like Charles or Fiddler.
You'll find them useful web development in general. The major difference is that Charles is shareware, whereas Fiddler is free.
Also, for simulating a slow connection on some *nixes, you can try using ipfw. More information is provided by Ben Newman's answer on this Quora question
You can use NetEm (Network Emulation) as a proxy server to emulate many network characteristics (speed, delay, packet loss, etc.). It controls the networking using iproute2 package and it's enabled in the kernel of most Linux distributions.
It is controlled by the tc command-line application (from the iproute2 package), but there are also some web interface GUIs for NetEm, for example PHPnetemGUI2.
The advantage is that, as I wrote, it can emulate not only different network speeds but also, for example, the packet loss, duplication and/or corruption, random or defined delay, etc., so apart from the slow connections, you can also emulate various poorly performing networks and transmission errors.
For your application it's absolutely transparent, you can configure the operating system to use the NetEm as a proxy server, so all connections from that machine will be routed through it. Or you can configure only a specific application to use that proxy.
I have been using it to test the performance of an Android app on various emulated poor-performance networks.
Use a tool like TCPMon. It can fake a slow connection.
Basically, you request it the exact same thing and it just forwards the exact same request to the real server, and then delays the response with only the set amount of bytes.
For Linux, the following list of papers might be useful:
A Comparative Study of Network Link Emulators (2009)
KauNet: A Versatile and Flexible Emulation System (2009)
Dummynet Revisited (2010)
Measuring Accuracy and Performance of Network Emulators (2015)
Personally, whilst Dummynet is good, I find NetEm to be the most versatile for my use-cases; I'm usually interested in the effect of delays, rather than bandwidth (i.e. WiFi connection issues), and it's super-easy to emulate random packet loss/corruption, etc. It's also very accessible, and free (unlike the hardware-based Linktropy).
On a side-note, for Windows, Clumsy is awesome. I would also like to add that (regarding websites) browser throttling is not an accurate method for emulating real-life network issues (I think "TKK" commented on a few of the reasons why above).
Hope this helps someone!
One common case of shaping a single TCP connection can actually be assembled from dual pairs of socat and cpipe in UNIX fashion like this:
socat TCP-LISTEN:5555,reuseaddr,reuseport,fork SYSTEM:'cpipe -ngr -b 1 -s 10 | socat - "TCP:localhost:5000" | cpipe -ngr -b 1 -s 300'
This simulates a connection with bandwidth of approximately 300kB/s from your service at :5000 and to at approximately 10kB/s and listens on :5555 for incoming connections. Caveat: Note that this per-connection, so each individual TCP connection gets this amount.
Explanation:
The outer (left) socat listens with the given options on :5555 as a forking server. The first cpipe command in the SYSTEM:... option then throttles data that went into socket :5555 (and comes out of the first, outer socat) to at most 10kByte/s. That data is then forwarding using another socat which connects to localhost:5000 (where the service you want to slow down should be listening). Data from localhost:5000 is then put into the right cpipe command, which (with the given values) throttles it to about 300kB/s.
The option -ngr to cpipe is important. It causes cpipe to read non-greedily from its input file-descriptor. Otherwise, you might get stuck with data in the buffers not being forwarded and waiting for a reply.
Using the more common buffer tool instead of cpipe is likely possible as well.
(Credits: This is based on the "double-tee" recipe by Christophe Loor from the socat documentation)
Mac OSX since 10.10 has an app called Murus Firewall, which acts as a GUI to pf, the replacement for ipfw.
It works very well for system-wide or domain-specific throttling. I was just able to use it to slide my download speed between 300Kbps and 30Mbps to test how a streaming video player adjusts.
Updating this (9 years after it was asked) as the answer I was looking for wasn't mentioned:
Firefox also has presets for throttling connection speeds. Find them in the Network Monitor tab of the developer tools. Default is 'No throttling'.
Slowest is GPRS (Download speed: 50 Kbps, Upload speed: 20 Kbps, Minimum latency (ms): 500), ranging through 'good' and 'regular' 2G, 3G and 4G to DSL and WiFi (Download speed: 30Mbps, Upload speed: 15Mbps, Minimum latency (ms): 2).
More in the Dev Tools docs.
There is also another tool called WIPFW - http://wipfw.sourceforge.net/
It's a bit old school, but you can use it to simulate a slower connection. It's Windows based, and the tool allows the administrator to monitor how much traffic the router is getting from a certain machine, or how much WWW traffic it is forwarding, for example.
There is a simple and practical way to do it, without any application or code. Just connect to the internet using a mobile hotspot. Keep moving the hotspot (phone) away from the connected device to simulate slower networks. 😉

Resources