I have two C-Applications, both running on the same machine on windows XP. Based on the data in this thread: Sockets On Same Machine For Windows and Linux I should see very high speed on this connection.
But I can not transfer more than 500mbit/s. I use 127.0.0.1 as IP-Adress, and also the nodelay option. A single message has about 3.5mbyte and I have to send up to 30 of those message per second.
If there is no possibility I will have to zip those message somehow, but this will create a huge overhead of CPU-Load.
Any idea?
The size of the buffer you are sending can have a big impact on performance. For instance, if you use a small buffer you will be doing a lot of costly writes where only one is necessary.
I recommend you too make writes of 1492 bytes, that's about the size TCP usually handles. You can play around with other values to see if you get better performance.
Related
I just got up and running a service that at its peak needs to handle simultaneous TCP connections in the tens of millions. It's currently running without much tuning by just scaling up to a large number of hosts. The software itself is written in Netty and doesn't do much except translating data frames coming over WebSocket pipes into Kafka events.
My current goal is to be able to pack as many connections on a single machine as possible. I've currently settled for EC2 r6i.2xlarge instances which have 8 CPUs and 64GB of memory and I'm looking for some advice on kernel network stack and Netty tuning.
Some stats on WebSocket traffic patterns:
Each client sends a WebSocket data frame about once per 10 seconds.
Data frames are less than 32KB in size and most are less than 4KB.
We can have sudden bursts of a few million connections in a matter of seconds (various competitions/events).
Many connections are quite short and by far the most common data is actually a TCP accept followed by a login data frame followed by a connection close a few tens of seconds later.
From the above we see that the bitrate per TCP connection is less than 1KB/s and the connections are mostly idle. However, on the backend side we push events to Kafka in batches, so those are a much smaller number of sockets pushing lots of data each.
I've currently increased the ulimit and tcp_max_orphans flags to about 10 million each, since I assumed that both of these would be an issue.
Anyone familiar with the TCP/IP stack internals that would have some advice on what the most important tunables to look into would be?
My own starting point would be to limit the amount of memory that each socket uses as well as increase the amount of memory available to the TCP/IP stack. However, the math here is not very clear from the docs, i.e. how the different flags relate, since I don't know exactly how much memory a single TCP connection consumes inside the kernel.
Some concrete questions:
What options to use in Netty for these frontend WebSocket TCP connections given the traffic patterns described?
How to minimize the amount of memory used per socket in the kernel as well as how to calculate/set kernel memory limits from there?
Anything else worth looking into?
As a test I wrote little tool to test the LAN connection between two PCs.
It is a client/server model that just sends as many UDP packets as it can and on the other side I read everything I can.
To max out my resources, I start a goroutine for every core my machine has.
Sending, receiving and measuring speed works, but when I get to high throughput (500+ Mb/s), the receiving end becomes completely unresponsive.
If I throttle the connection, I don't have any problems.
Also my CPU maxes out just one core (although i used runtime.GOMAXPROCS(0) and start to receive in runtime.NumCPU goroutines)
I uploaded the code to GitHub over here: https://github.com/femot/lanbench
If I change the client to run locally, the problem does not occur. It only happens, if I start the client from another PC (although the measured speed also tops out at 650 Mb/s)
Your server is limited first by the delta channel with a buffer of 100. I'm sure at any significant packet rate that you will be overwhelming that loop.
This isn't a very good benchmark, since your packet rate is going to be a limiting factor more so than bandwidth. You're specifically only trying to test how fast Go can send and receive 1024byte UDP datagrams.
Regardless of how many goroutines you start, the IO is all going through the network poller in a single thread. If you can't saturate your link with a single core, you're going to need multiple process or you need to do this in another language.
I was trying to set up a bandwidth test between two PCs, with only a switch between them. All network hardware is gigabit. One one machine I put a program to open a socket, listen for connections, accept, followed by a loop to read data and measure bytes received against the 'performance counter'. On the other machine, the program opened a socket, connected to the first machine, and proceeds into a tight loop to pump data into the connection as fast as possible, in 1K blocks per send() call. With just that setup, things seem acceptably fast; I could get about 30 to 40 MBytes/sec through the network - distinctly faster than 100BaseT, within the realm of plausibility for gigabit h/w.
Here's where the fun begins: I tried to use setsockopt() to set the size of the buffers (SO_SNDBUF, SO_RCVBUF) on each end to 1K. Suddenly the receiving end reports it's getting a mere 4,000 or 5,000 bytes a second. Instrumenting the transmit side of things, it appears that the send() calls take 0.2 to 0.3 seconds each, just to send 1K blocks. Removing the setsockopt() from the receive side didn't seem to change things.
Now clearly, trying to manipulate the buffer sizes was a Bad Idea. I had thought that maybe forcing the buffer size to 1K, with send() calls of 1K, would be a way to force the OS to put one packet on the wire per send call, with the understanding that this would prevent the network stack from efficiently combining the data for transmission - but I didn't expect throughput to drop to a measly 4-5K/sec!
I don't have time on the resources to chase this down and really understand it the way I'd like to, but would really like to know what could make a send() take 0.2 seconds. Even if it's waiting for acks from the other side, 0.2 seconds is just unbelievable. What gives?
Nagle?
Windows networks with small messages
The explanation is simply that a 1k buffer is an incredibly small buffer size, and your sending machine is probably sending one packet at a time. The sender must wait for the acknowledgement from the receiver before emptying the buffer and accepting the next block to send from your application (because the TCP layer may need to retransmit data later).
A more interesting exercise would be to vary the buffer size from its default for your system (query it to find out what that is) all the way down to 1k and see how each buffer size affects your throughput.
my (network) client sends 50 to 100 KB data packets every 200ms to my server. there're up to 300 clients. Server sends nothing to client. Server (dedicated) and clients are in LAN. How can I tune TCP configuration for better performance? Server on Windows Server 2003 or 2008, clients on Windows 2000 and up.
e.g. TCP window size. Does changing this parameter help? anything else? any special socket options?
[EDIT]: actually in different modes packets can be up to 5MB
I did a study on this a couple of years ago wth 1700 data points. The conclusion was that the single best thing you can do is configure an enormous socket receive buffer (e.g. 512k) at the receiver. Do that to the listening socket, so it will be inherited by the accepted sockets, so it will already be set while they are handshaking. That in turn allows TCP window scaling to be negotiated during the handshake, which allows the client to know about the window size > 64k. The enormous window size basically lets the client transmit at the maximum possible rate, subject only to congestion avoidance rather than closed receive windows.
What OS?
IPv4 or v6?
Why so large of a dump ; why can't it be broken down?
Assuming a solid, stable, low bandwidth:delay prod, you can adjust things like inflight sizing, initial window size, mtu (depending on the data, IP version, and mode[tcp/udp].
You could also round robin or balance inputs, so you have less interrupt time from the nic .. binding is an option as well..
5MB /packet/? That's a pretty poor design .. I would think it'd lead to a lot of segment retrans's , and a LOT of kernel/stack mem being used in sequence reconstruction / retransmits (accept wait time, etc)..
(Is that even possible?)
Since all clients are in LAN, you might try enabling "jumbo frames" (need to run a netsh command for that, would need to google for the precise command, but there are plenty of how-tos).
On the application layer, you could use TransmitFile, which is the Windows sendfile equivalent and which works very well under Windows Server 2003 (it is artificially rate-limited under "non server", but that won't be a problem for you). Note that you can use a memory mapped file if you generate the data on the fly.
As for tuning parameters, increasing the send buffer will likely not give you any benefit, though increasing the receive buffer may help in some cases because it reduces the likelihood of packets being dropped if the receiving application does not handle the incoming data fast enough. A bigger TCP window size (registry setting) may help, as this allows the sender to send out more data before having to block until ACKs arrive.
Yanking up the program's working set quota may be worth a consideration, it costs you nothing and may be an advantage, since the kernel needs to lock pages when sending them. Being allowed to have more pages locked might make things faster (or might not, but it won't hurt either, the defaults are ridiculously low anyway).
We are using a business Ethernet connection (3Mbit upload, 3Mbit download) and trying to understand issues with our tested bandwidth speeds. When uploading a large file we sustain 340 KB/s; downloading we sustain 340KB/s. However when we run these transfers simultaneously the two transfer speeds rise and fall erratically with a average speed for both at around 250 KB/s. We're using a Hatteras HN404 CPi and we've bypassed the router (plugged a machine directly into the Hatteras; set the NIC to full-duplex).
Is this expected? Should a max upload interfere with a max download on this type of Internet connection?
Are you sure the bottleneck is your connection?
Do you also see this behavior when the simultaneous upload and download are occurring on different systems, or only when one system is handling both the upload and download?
If the problem goes away when independent machines are doing the work, the bottleneck is likely closer to the hard drive.
This sounds expected from my experience with lower end lines. On a home line, I've found that traffic shaping and changing buffer sizes can be a huge help.
TCP/IP without any unusual traffic shaping will favor the most aggressive traffic at the expense of everything else. In your case, this means responses to the outgoing ACKs and such for the download will be delayed or maybe even dropped. See if your HN404 supports class based queuing or something similar and try it out.
Yes it is expected. This is symptomatic of any case in which you have a throttled or capped connection. If you saturate your uplink it will affect your downlink and vice versa.
This is because the your connection's rate-limiting impacts the TCP handshake acknowledgement packets (ACKs) and disrupts the normal "balance" of how these packets flow.
This is very thoroughly described on this page about Cable Modem Troubleshooting Tips, although it is not limited to cable modems:
If you saturate your cable modem's
upload cap with an upload, the ACK
packets of your download will have to
queue up waiting for a gap between the
congested upload data packets. So your
ACKs will be delayed getting back to
the remote download server, and it
will therefore believe you are on a
very slow link, and slow down the
transmission of further data to you.
So how do you avoid this? The best way is to implement some sort of traffic-shaping or QoS (Quality of Service) on individual sessions to limit them to a maximum throughput based on a percentage of your total available bandwidth.
For example on my home network I have it so that no outbound connection can utilize any more than 67% (2/3rd) of my 192Kbps uplink. That means any single outbound session can only utilized 128Kbps, therefore protecting my downlink speed by preventing the uplink from becoming saturated.
In most cases you are able to perform this kind of traffic-shaping based on any available criteria such as source ip, destination ip, protocol, port, time of day, etc.
It appears that I was wrong about the simultaneous transfer speeds. The 250KB/s speeds up and down were miscalculated by the transfer program (seemed to have been showing a high average speed). Apparently the Business Ethernet (in this case it is an XO circuit provisioned by Speakeasy) only supports 3Mb total, not up AND down (for 6Mbit total). So if I am transferring up and down at the same time in theory I should only have 1.5Mbit up and down or 187.5KB/s at the maximum (if there was zero overhead).