From iperf man:
iperf is a tool for performing network throughput measurements. It can
test either TCP or UDP throughput. To perform an iperf test the user
must establish both a server (to discard traffic) and a client (to gen-
erate traffic).
Basically you run iperf server at one end and iperf client at other end.
My question is :
suppose there is machine A and machine B and you run iperf server at Machine A and client at Machine B you get X number
iperf server at Machine B and iperf client at Machine A, you get Y number.
X and Y denotes the throughput. My question is X denotes which machine (A/B) throughput?
If in case you say it is not machine specific and only denotes throughput between links, Why I should observe different throughput when I swap the client and server(Actually I have observed)?
Thnx in Advance.
When you swap the machines, you're swapping the operations which are slightly different. In case of TCP, sending is directly affected by the buffers length at the TCP/IP stack (since iperf is an application dumping down this data to the stack) so what it will measures is how fast it was able to dump the data to the TCP/IP stack, the smaller the buffers, is the slower the operation gets. Rx is different, most likely the app CPU will be consuming these packets as they arrive, so the buffers would be empty most of the time.
Related
The compute board has 2 100Gbps ports, one connects to a client machine, the other server machine which runs HTTP server. Client machine generates HTTP requests to the HTTP server and all traffic go through this compute board.
So given this setup, I wonder what is the upper bound of L2 throughput of the board? Let's assume the board's internal bus and processing are more than enough to handle 200Gbps throughput, and client and server machines are both powerful enough to saturate the board.
I think the upper bound is 100Gbps. But somehow I am not so sure, is it possible a 100G port can handle 100G ingress and 100G egress at the same time? If so, that means the upper board is 200Gbps, is that right?
I just got up and running a service that at its peak needs to handle simultaneous TCP connections in the tens of millions. It's currently running without much tuning by just scaling up to a large number of hosts. The software itself is written in Netty and doesn't do much except translating data frames coming over WebSocket pipes into Kafka events.
My current goal is to be able to pack as many connections on a single machine as possible. I've currently settled for EC2 r6i.2xlarge instances which have 8 CPUs and 64GB of memory and I'm looking for some advice on kernel network stack and Netty tuning.
Some stats on WebSocket traffic patterns:
Each client sends a WebSocket data frame about once per 10 seconds.
Data frames are less than 32KB in size and most are less than 4KB.
We can have sudden bursts of a few million connections in a matter of seconds (various competitions/events).
Many connections are quite short and by far the most common data is actually a TCP accept followed by a login data frame followed by a connection close a few tens of seconds later.
From the above we see that the bitrate per TCP connection is less than 1KB/s and the connections are mostly idle. However, on the backend side we push events to Kafka in batches, so those are a much smaller number of sockets pushing lots of data each.
I've currently increased the ulimit and tcp_max_orphans flags to about 10 million each, since I assumed that both of these would be an issue.
Anyone familiar with the TCP/IP stack internals that would have some advice on what the most important tunables to look into would be?
My own starting point would be to limit the amount of memory that each socket uses as well as increase the amount of memory available to the TCP/IP stack. However, the math here is not very clear from the docs, i.e. how the different flags relate, since I don't know exactly how much memory a single TCP connection consumes inside the kernel.
Some concrete questions:
What options to use in Netty for these frontend WebSocket TCP connections given the traffic patterns described?
How to minimize the amount of memory used per socket in the kernel as well as how to calculate/set kernel memory limits from there?
Anything else worth looking into?
Given that UDP packets don't actually send acks, how does a program like iperf measure their one-way performance, i.e., how can it confirm that the packets actually reached:
within a time frame
intact, and uncorrupted
To contrast, Intuitively, to me, it seems that TCP packets, which have an ack signal sent back to allow rigorous benchmarking of their movement across a network can be done very reliably from a client.
1/ "how can it confirm that the packets actually reached [...] intact, and uncorrupted"
UDP is an unfairly despised protocol, but come on, this is going way too far here! :-)
UDP have checksum, just like TCP:
https://en.wikipedia.org/wiki/User_Datagram_Protocol#Checksum_computation
2/ "how can it confirm that the packets actually reached [...] within a time frame"
It does not, because this is not what UDP is about, nor TCP by the way.[*]
As can be seen from its source code here:
https://github.com/esnet/iperf/blob/master/src/iperf_udp.c#L55
...what it does though, is check for out of order packets. A "pcount" is set in the sending side, and checked at the receiving side here:
https://github.com/esnet/iperf/blob/master/src/iperf_udp.c#L99
...and somewhat calculate a bogus jitter:
https://github.com/esnet/iperf/blob/master/src/iperf_udp.c#L110
(real life is more complicated than this, you not only have jitter, but also drift)
[*]:
For semi-garanteed, soft "within a time frame" / real time layer 3 and above protocols, look at RTP, RTSP and such. But neither TCP nor UDP inherently have this.
For real, serious hard real-time garantee, you've got to go to layer 2 protocols such as Ethernet-AVB:
https://en.wikipedia.org/wiki/Audio_Video_Bridging
...which were designed because IP and above simply cannot. make. hard. real. time. guaranteed. delivery. Period.
EDIT:
This is another debate, but...
The first thing you need for "within a time frame", is a shared wall clock on sending/receiving systems (else, how could you tell that such received packet is out of date?)
From Layer 3 (IP) and above, NTP precision target is about 1ms. It can be less than that on a LAN (but accross IP networks, it's just taking a chance and hope the best).
On layer 2, aka "LAN" the layer 2 PTP (Precision Time Protocol) IEEE 1588 is for sub-microsecond range. That's a 1000 times more accurate. Same goes for the derived IEEE 802.1AS, "Timing and Synchronization for Time-Sensitive Applications (gPTP)" used In Ethernet AVB.
Conclusion on this sub-topic:
TCP/IP, though very handy and powerful, is not designed to "guarantee delivery within a time frame". Be it TCP or UDP. Get this idea out of your head.
The obvious way would be to connect to a server that participates in the testing.
The client starts by (for example) connecting to an NTP server to get an accurate time base.
Then the UDP client sends a series of packets to the server. In its payload, each packet contains:
a serial number
a timestamp when it was sent
a CRC
The server then looks these over and notes whether any serial numbers are missing (after some reasonable timeout) and compares the time it received each packet to the time the client sent the packet. After some period of time, the server sends a reply indicating how many packets it received, the mean and standard deviation of the transmission times, and an indication of how many appeared to be corrupted based on the CRCs they contained.
Depending on taste, you might also want to set up a simultaneous TCP connection from the client to the server to coordinate testing of the UDP channel and (possibly) return results.
I've made a simple experiment on TCP transportation performance. The experiment is as follows:
There are two machines, A and B, each installed with Ubuntu 12.04 Server. I've installed "iperf" on either machine, and use it to test the transportation rate. A and B are connected through a 100Mbps link. The experiment is like this:
I use iperf to send from A to B using TCP mode. The result is that on both side the rate output by iperf is 100Mbps, and is very stable.
I use another iperf process to send from B to A, using the same settings. The result is that on both side the rate output is a little lower, 99Mbps, stably. But this is understandable.
I use one more iperf process to again send from A to B, with the presence of the previous two traffic flows. Now the wired thing happens. The rate of the three traffic flows are all 50Mbps, on both sides. The rates are all very stable.
I understand the reason that flow 1 and flow 3 share the one-direction link and both have a bandwidth of 50Mbps. But what is the reason that the backward flow, flow 2, is also affected and is also 50Mbps? Shouldn't the bidirectional link be regarded as two different links that have no interference with each other?
We are using a business Ethernet connection (3Mbit upload, 3Mbit download) and trying to understand issues with our tested bandwidth speeds. When uploading a large file we sustain 340 KB/s; downloading we sustain 340KB/s. However when we run these transfers simultaneously the two transfer speeds rise and fall erratically with a average speed for both at around 250 KB/s. We're using a Hatteras HN404 CPi and we've bypassed the router (plugged a machine directly into the Hatteras; set the NIC to full-duplex).
Is this expected? Should a max upload interfere with a max download on this type of Internet connection?
Are you sure the bottleneck is your connection?
Do you also see this behavior when the simultaneous upload and download are occurring on different systems, or only when one system is handling both the upload and download?
If the problem goes away when independent machines are doing the work, the bottleneck is likely closer to the hard drive.
This sounds expected from my experience with lower end lines. On a home line, I've found that traffic shaping and changing buffer sizes can be a huge help.
TCP/IP without any unusual traffic shaping will favor the most aggressive traffic at the expense of everything else. In your case, this means responses to the outgoing ACKs and such for the download will be delayed or maybe even dropped. See if your HN404 supports class based queuing or something similar and try it out.
Yes it is expected. This is symptomatic of any case in which you have a throttled or capped connection. If you saturate your uplink it will affect your downlink and vice versa.
This is because the your connection's rate-limiting impacts the TCP handshake acknowledgement packets (ACKs) and disrupts the normal "balance" of how these packets flow.
This is very thoroughly described on this page about Cable Modem Troubleshooting Tips, although it is not limited to cable modems:
If you saturate your cable modem's
upload cap with an upload, the ACK
packets of your download will have to
queue up waiting for a gap between the
congested upload data packets. So your
ACKs will be delayed getting back to
the remote download server, and it
will therefore believe you are on a
very slow link, and slow down the
transmission of further data to you.
So how do you avoid this? The best way is to implement some sort of traffic-shaping or QoS (Quality of Service) on individual sessions to limit them to a maximum throughput based on a percentage of your total available bandwidth.
For example on my home network I have it so that no outbound connection can utilize any more than 67% (2/3rd) of my 192Kbps uplink. That means any single outbound session can only utilized 128Kbps, therefore protecting my downlink speed by preventing the uplink from becoming saturated.
In most cases you are able to perform this kind of traffic-shaping based on any available criteria such as source ip, destination ip, protocol, port, time of day, etc.
It appears that I was wrong about the simultaneous transfer speeds. The 250KB/s speeds up and down were miscalculated by the transfer program (seemed to have been showing a high average speed). Apparently the Business Ethernet (in this case it is an XO circuit provisioned by Speakeasy) only supports 3Mb total, not up AND down (for 6Mbit total). So if I am transferring up and down at the same time in theory I should only have 1.5Mbit up and down or 187.5KB/s at the maximum (if there was zero overhead).