How to explain this incredibly slow socket connection? - windows

I was trying to set up a bandwidth test between two PCs, with only a switch between them. All network hardware is gigabit. One one machine I put a program to open a socket, listen for connections, accept, followed by a loop to read data and measure bytes received against the 'performance counter'. On the other machine, the program opened a socket, connected to the first machine, and proceeds into a tight loop to pump data into the connection as fast as possible, in 1K blocks per send() call. With just that setup, things seem acceptably fast; I could get about 30 to 40 MBytes/sec through the network - distinctly faster than 100BaseT, within the realm of plausibility for gigabit h/w.
Here's where the fun begins: I tried to use setsockopt() to set the size of the buffers (SO_SNDBUF, SO_RCVBUF) on each end to 1K. Suddenly the receiving end reports it's getting a mere 4,000 or 5,000 bytes a second. Instrumenting the transmit side of things, it appears that the send() calls take 0.2 to 0.3 seconds each, just to send 1K blocks. Removing the setsockopt() from the receive side didn't seem to change things.
Now clearly, trying to manipulate the buffer sizes was a Bad Idea. I had thought that maybe forcing the buffer size to 1K, with send() calls of 1K, would be a way to force the OS to put one packet on the wire per send call, with the understanding that this would prevent the network stack from efficiently combining the data for transmission - but I didn't expect throughput to drop to a measly 4-5K/sec!
I don't have time on the resources to chase this down and really understand it the way I'd like to, but would really like to know what could make a send() take 0.2 seconds. Even if it's waiting for acks from the other side, 0.2 seconds is just unbelievable. What gives?

Nagle?
Windows networks with small messages

The explanation is simply that a 1k buffer is an incredibly small buffer size, and your sending machine is probably sending one packet at a time. The sender must wait for the acknowledgement from the receiver before emptying the buffer and accepting the next block to send from your application (because the TCP layer may need to retransmit data later).
A more interesting exercise would be to vary the buffer size from its default for your system (query it to find out what that is) all the way down to 1k and see how each buffer size affects your throughput.

Related

Measuring client back-up when using Boost.Beast WebSocket

I am reading from a Boost.Beast WebSocket. When my application gets backed up, the websocket sender appears happy to delay/buffer the data on their end (presumably at the application level, as they will delay by 1 minute or more).
What is the best way to measure if I am getting backed up? For example, can I look at the size of a TCP buffer? I could also read all the data into memory in a fast thread, and put it in a queue for the slow thread (in which case, backup can be measured by the size of the queue). But I'm wondering if there's a more direct way.
This varies by platform, but there's the SO_RCVBUF option that sets the amount of data that can be queued onto the socket before TCP pauses receiving more data.
If you have access to the socket, s, invoke this to inspect how much data its rcv buffer size can hold
net::socket_base::receive_buffer_size opt = {};
s.get_option(opt);
You'll probably see that it defaults to something like 64K or so.
Then crank it up real high to like a megabyte:
net::socket_base::receive_buffer_size optSet(1000000);
boost::system::error_code ec;
s.set_option(optSet, ec);
YMMV on how large of a value you can pass to the set_option call and how much actually helps.
Keep in mind, this is only a temporary measure to relieve the pressure. If you keep getting backed up, you'll only hit the limit again, just a bit later and perhaps less often.
I could also read all the data into memory in a fast thread, and put it in a queue for the slow thread
Yes, but you've basically implemented exactly what SO_RCVFROM does. Either that, or you buffer to infinity with respect to memory cost (no limit).

Reading from net.UDPConn locks up PC

As a test I wrote little tool to test the LAN connection between two PCs.
It is a client/server model that just sends as many UDP packets as it can and on the other side I read everything I can.
To max out my resources, I start a goroutine for every core my machine has.
Sending, receiving and measuring speed works, but when I get to high throughput (500+ Mb/s), the receiving end becomes completely unresponsive.
If I throttle the connection, I don't have any problems.
Also my CPU maxes out just one core (although i used runtime.GOMAXPROCS(0) and start to receive in runtime.NumCPU goroutines)
I uploaded the code to GitHub over here: https://github.com/femot/lanbench
If I change the client to run locally, the problem does not occur. It only happens, if I start the client from another PC (although the measured speed also tops out at 650 Mb/s)
Your server is limited first by the delta channel with a buffer of 100. I'm sure at any significant packet rate that you will be overwhelming that loop.
This isn't a very good benchmark, since your packet rate is going to be a limiting factor more so than bandwidth. You're specifically only trying to test how fast Go can send and receive 1024byte UDP datagrams.
Regardless of how many goroutines you start, the IO is all going through the network poller in a single thread. If you can't saturate your link with a single core, you're going to need multiple process or you need to do this in another language.

TCP/IP on localhost slow

I have two C-Applications, both running on the same machine on windows XP. Based on the data in this thread: Sockets On Same Machine For Windows and Linux I should see very high speed on this connection.
But I can not transfer more than 500mbit/s. I use 127.0.0.1 as IP-Adress, and also the nodelay option. A single message has about 3.5mbyte and I have to send up to 30 of those message per second.
If there is no possibility I will have to zip those message somehow, but this will create a huge overhead of CPU-Load.
Any idea?
The size of the buffer you are sending can have a big impact on performance. For instance, if you use a small buffer you will be doing a lot of costly writes where only one is necessary.
I recommend you too make writes of 1492 bytes, that's about the size TCP usually handles. You can play around with other values to see if you get better performance.

how TCP can be tuned for high-performance one-way transmission?

my (network) client sends 50 to 100 KB data packets every 200ms to my server. there're up to 300 clients. Server sends nothing to client. Server (dedicated) and clients are in LAN. How can I tune TCP configuration for better performance? Server on Windows Server 2003 or 2008, clients on Windows 2000 and up.
e.g. TCP window size. Does changing this parameter help? anything else? any special socket options?
[EDIT]: actually in different modes packets can be up to 5MB
I did a study on this a couple of years ago wth 1700 data points. The conclusion was that the single best thing you can do is configure an enormous socket receive buffer (e.g. 512k) at the receiver. Do that to the listening socket, so it will be inherited by the accepted sockets, so it will already be set while they are handshaking. That in turn allows TCP window scaling to be negotiated during the handshake, which allows the client to know about the window size > 64k. The enormous window size basically lets the client transmit at the maximum possible rate, subject only to congestion avoidance rather than closed receive windows.
What OS?
IPv4 or v6?
Why so large of a dump ; why can't it be broken down?
Assuming a solid, stable, low bandwidth:delay prod, you can adjust things like inflight sizing, initial window size, mtu (depending on the data, IP version, and mode[tcp/udp].
You could also round robin or balance inputs, so you have less interrupt time from the nic .. binding is an option as well..
5MB /packet/? That's a pretty poor design .. I would think it'd lead to a lot of segment retrans's , and a LOT of kernel/stack mem being used in sequence reconstruction / retransmits (accept wait time, etc)..
(Is that even possible?)
Since all clients are in LAN, you might try enabling "jumbo frames" (need to run a netsh command for that, would need to google for the precise command, but there are plenty of how-tos).
On the application layer, you could use TransmitFile, which is the Windows sendfile equivalent and which works very well under Windows Server 2003 (it is artificially rate-limited under "non server", but that won't be a problem for you). Note that you can use a memory mapped file if you generate the data on the fly.
As for tuning parameters, increasing the send buffer will likely not give you any benefit, though increasing the receive buffer may help in some cases because it reduces the likelihood of packets being dropped if the receiving application does not handle the incoming data fast enough. A bigger TCP window size (registry setting) may help, as this allows the sender to send out more data before having to block until ACKs arrive.
Yanking up the program's working set quota may be worth a consideration, it costs you nothing and may be an advantage, since the kernel needs to lock pages when sending them. Being allowed to have more pages locked might make things faster (or might not, but it won't hurt either, the defaults are ridiculously low anyway).

How do you rate-limit an IO operation?

Suppose you have a program which reads from a socket. How do you keep the download rate below a certain given threshold?
At the application layer (using a Berkeley socket style API) you just watch the clock, and read or write data at the rate you want to limit at.
If you only read 10kbps on average, but the source is sending more than that, then eventually all the buffers between it and you will fill up. TCP/IP allows for this, and the protocol will arrange for the sender to slow down (at the application layer, probably all you need to know is that at the other end, blocking write calls will block, nonblocking writes will fail, and asynchronous writes won't complete, until you've read enough data to allow it).
At the application layer you can only be approximate - you can't guarantee hard limits such as "no more than 10 kb will pass a given point in the network in any one second". But if you keep track of what you've received, you can get the average right in the long run.
Assuming a network transport, a TCP/IP based one, Packets are sent in response to ACK/NACK packets going the other way.
By limiting the rate of packets acknowledging receipt of the incoming packets, you will in turn reduce the rate at which new packets are sent.
It can be a bit imprecise, so its possibly optimal to monitor the downstream rate and adjust the response rate adaptively untill it falls inside a comfortable threshold. ( This will happen really quick however, you send dosens of acks a second )
It is like when limiting a game to a certain number of FPS.
extern int FPS;
....
timePerFrameinMS = 1000/FPS;
while(1) {
time = getMilliseconds();
DrawScene();
time = getMilliseconds()-time;
if (time < timePerFrameinMS) {
sleep(timePerFrameinMS - time);
}
}
This way you make sure that the game refresh rate will be at most FPS.
In the same manner DrawScene can be the function used to pump bytes into the socket stream.
If you're reading from a socket, you have no control over the bandwidth used - you're reading the operating system's buffer of that socket, and nothing you say will make the person writing to the socket write less data (unless, of course, you've worked out a protocol for that).
All that reading slowly would do is fill up the buffer, and cause an eventual stall on the network end - but you have no control of how or when this happens.
If you really want to read only so much data at a time, you can do something like this:
ReadFixedRate() {
while(Data_Exists()) {
t = GetTime();
ReadBlock();
while(t + delay > GetTime()) {
Delay()'
}
}
}
wget seems to manage it with the --limit-rate option. Here's from the man page:
Note that Wget implements the limiting
by sleeping the appropriate amount of
time after a network read that took
less time than specified by the
rate. Eventually this strategy causes
the TCP transfer to slow down to
approximately the specified rate.
However, it may take some time for
this balance to be achieved, so don't
be surprised if limiting the rate
doesn't work well with very small
files.
As other have said, the OS kernel is managing the traffic and you are simply reading a copy of the data out of kernel memory. To roughly limit the rate of just one application, you need to delay your reads of the data and allow incoming packets to buffer up in the kernel, which will eventually slow the acknowledgment of incoming packets and reduce the rate on that one socket.
If you want to slow all traffic to the machine, you need to go and adjust the sizes of your incoming TCP buffers. In Linux, you would affect this change by altering the values in /proc/sys/net/ipv4/tcp_rmem (read memory buffer sizes) and other tcp_* files.
To add to Branan's answer:
If you voluntarily limit the read speed at the receiver end, eventually queues will fill up at both end. Then the sender will either block in its send() call or return from the send() call with a sent_length less than the expected length passed on to the send() call.
If the sender is not ready to deal with this case by sleeping and trying to resend what has not fit into OS buffers, you will ending up have connection issues (the sender may detect this as an error) or losing data (the sender may unknowingly discard data the did not fit into OS buffers).
Set small socket send and receive buffers, say 1k or 2k, such that the bandwidth*delay product = the buffer size. You may not be able to get it small enough over fast links.

Resources