Reading from net.UDPConn locks up PC - go

As a test I wrote little tool to test the LAN connection between two PCs.
It is a client/server model that just sends as many UDP packets as it can and on the other side I read everything I can.
To max out my resources, I start a goroutine for every core my machine has.
Sending, receiving and measuring speed works, but when I get to high throughput (500+ Mb/s), the receiving end becomes completely unresponsive.
If I throttle the connection, I don't have any problems.
Also my CPU maxes out just one core (although i used runtime.GOMAXPROCS(0) and start to receive in runtime.NumCPU goroutines)
I uploaded the code to GitHub over here: https://github.com/femot/lanbench
If I change the client to run locally, the problem does not occur. It only happens, if I start the client from another PC (although the measured speed also tops out at 650 Mb/s)

Your server is limited first by the delta channel with a buffer of 100. I'm sure at any significant packet rate that you will be overwhelming that loop.
This isn't a very good benchmark, since your packet rate is going to be a limiting factor more so than bandwidth. You're specifically only trying to test how fast Go can send and receive 1024byte UDP datagrams.
Regardless of how many goroutines you start, the IO is all going through the network poller in a single thread. If you can't saturate your link with a single core, you're going to need multiple process or you need to do this in another language.

Related

Why does using multiple ethernet connections slow down throughput of I/O bound task

So I've got an interesting problem that seems counterintuitive to me. I am building a tool where the biggest bottleneck is the rate at which I can send packets. Currently I can handle over a million requests in less than 30 seconds which is great but I'm trying to squeeze out as much speed as possible. My idea was to attach a second ethernet adapter to the machine and spin up two different net.Dialer's like so
net.Dialer{
Timeout: time.Duration(*timeoutPtr) * time.Second,
LocalAddr: addr,
}
where addr is one of the two ethernet adapters. Then I assign the dialers to a job round robin style like so:
for i, target := range targets {
dialer = dialers[i%len(dialers)]
....
go someNetworkFunction(dialer)
}
What's surprising to me is that when I run it with 2 adapters it executes much much slower, 30 seconds vs 2 minutes! I'm just trying to understand why giving the code two connections to send packets slows down the code instead of speeding it up. It doesn't appear that the modulus operation there should cause a 300% slowdown. Is there something happening at the kernel layer when trying to use both adapters to send at the same time? Any help would be appreciated.
There can be multiple factors in play:
Kernel or benchmark process in general.
If you run a profile of your app, which with mln request in 30seconds, is spending not too much application time, you will probably see that syscall is using more of your time. syscall represents the (out of view) cpu time spend out of view of your application.
If this syscall time increases non-linear compared to the process you are benchmarking, you have a bottleneck outside of your program.
Go routines
go routines are scheduled against CPU cores (on the physical level). While they are easy create, the actual switching between go routines is not overhead free. The implementation of someNetworkFunction can make a difference in the throughput where you can block resources, or just switch too often. You can try to manage this by instructing the go program to use less threads with GOMAXPROCS. By tweaking this value, you can determine for your program and hardware what an optimal value is.
A more in depth explanation of the scheduler can be found at https://rakyll.org/scheduler/

Understanding SPDY latency claims

Reading the SPDY whitepaper at http://dev.chromium.org/spdy/spdy-whitepaper, it seems like supporting it will improve my HTTP latency. However, I'm not clear on a few of the claims.
1) "Because HTTP can only fetch one resource at a time (HTTP pipelining helps, but still enforces only a FIFO queue), a server delay of 500 ms prevents reuse of the TCP channel for additional requests." -- Where did this 500ms number come from?
2) "We discovered that SPDY's latency savings increased proportionally with increases in packet loss rates, up to a 48% speedup at 2%." -- But doesn't putting all the requests on a single TCP connection mean that congestion control will slow down all your requests whereas is you had multiple connections, 1 TCP stream would slow down but others would not?
3) "[With pipelining] any delays in the processing of anything in the stream (either a long request at the head-of-line or packet loss) will delay the entire stream." -- This implies that packet loss would not delay the entire stream using SPDY. Why not?
The 500ms reference is simply an example, the number can be 50ms or 5s, but the point is still the same: HTTP forces FIFO processing, which results in inefficient use of the underlying TCP connection. As the paper notes, pipelining can help in theory, but in practice pipeline is not used due to many intermediaries which break when you turn it on. Hence, you're stuck with the worst case scenario: full RTT + server processing time, and FIFO ordering.
Re, packet loss. Yup, you're exactly right. One of the downsides of using a single connection is that in the case of packet loss, the throughput of the entire connection is cut in half, as opposed to 1/2 of one of the N connections in flight. Having said that, there are also some benefits! For example, when you saturate a single connection, you get much faster recovery due to triple ACK's + potentially much wider congestion windows to begin with.. Due to the fact that most HTTP transfers are relatively small (tens of KB's), it is not unusual for many connections to terminate even before they exit the slow-start phase!
Re, pipelining. Lost packet would delay the stream - that's TCP. The win is in eliminating head-of-line blocking, which enables a lot more and a lot smarter optimization by the browser, followed by some of the wins I described above.
#GroovyDotCom: Here's some hands-on proof of HTTP2's (SPDY's) performance benefits:
http://www.httpvshttps.com/

How to explain this incredibly slow socket connection?

I was trying to set up a bandwidth test between two PCs, with only a switch between them. All network hardware is gigabit. One one machine I put a program to open a socket, listen for connections, accept, followed by a loop to read data and measure bytes received against the 'performance counter'. On the other machine, the program opened a socket, connected to the first machine, and proceeds into a tight loop to pump data into the connection as fast as possible, in 1K blocks per send() call. With just that setup, things seem acceptably fast; I could get about 30 to 40 MBytes/sec through the network - distinctly faster than 100BaseT, within the realm of plausibility for gigabit h/w.
Here's where the fun begins: I tried to use setsockopt() to set the size of the buffers (SO_SNDBUF, SO_RCVBUF) on each end to 1K. Suddenly the receiving end reports it's getting a mere 4,000 or 5,000 bytes a second. Instrumenting the transmit side of things, it appears that the send() calls take 0.2 to 0.3 seconds each, just to send 1K blocks. Removing the setsockopt() from the receive side didn't seem to change things.
Now clearly, trying to manipulate the buffer sizes was a Bad Idea. I had thought that maybe forcing the buffer size to 1K, with send() calls of 1K, would be a way to force the OS to put one packet on the wire per send call, with the understanding that this would prevent the network stack from efficiently combining the data for transmission - but I didn't expect throughput to drop to a measly 4-5K/sec!
I don't have time on the resources to chase this down and really understand it the way I'd like to, but would really like to know what could make a send() take 0.2 seconds. Even if it's waiting for acks from the other side, 0.2 seconds is just unbelievable. What gives?
Nagle?
Windows networks with small messages
The explanation is simply that a 1k buffer is an incredibly small buffer size, and your sending machine is probably sending one packet at a time. The sender must wait for the acknowledgement from the receiver before emptying the buffer and accepting the next block to send from your application (because the TCP layer may need to retransmit data later).
A more interesting exercise would be to vary the buffer size from its default for your system (query it to find out what that is) all the way down to 1k and see how each buffer size affects your throughput.

Cannot achieve full speed on Symmetrical Internet Connection

We are using a business Ethernet connection (3Mbit upload, 3Mbit download) and trying to understand issues with our tested bandwidth speeds. When uploading a large file we sustain 340 KB/s; downloading we sustain 340KB/s. However when we run these transfers simultaneously the two transfer speeds rise and fall erratically with a average speed for both at around 250 KB/s. We're using a Hatteras HN404 CPi and we've bypassed the router (plugged a machine directly into the Hatteras; set the NIC to full-duplex).
Is this expected? Should a max upload interfere with a max download on this type of Internet connection?
Are you sure the bottleneck is your connection?
Do you also see this behavior when the simultaneous upload and download are occurring on different systems, or only when one system is handling both the upload and download?
If the problem goes away when independent machines are doing the work, the bottleneck is likely closer to the hard drive.
This sounds expected from my experience with lower end lines. On a home line, I've found that traffic shaping and changing buffer sizes can be a huge help.
TCP/IP without any unusual traffic shaping will favor the most aggressive traffic at the expense of everything else. In your case, this means responses to the outgoing ACKs and such for the download will be delayed or maybe even dropped. See if your HN404 supports class based queuing or something similar and try it out.
Yes it is expected. This is symptomatic of any case in which you have a throttled or capped connection. If you saturate your uplink it will affect your downlink and vice versa.
This is because the your connection's rate-limiting impacts the TCP handshake acknowledgement packets (ACKs) and disrupts the normal "balance" of how these packets flow.
This is very thoroughly described on this page about Cable Modem Troubleshooting Tips, although it is not limited to cable modems:
If you saturate your cable modem's
upload cap with an upload, the ACK
packets of your download will have to
queue up waiting for a gap between the
congested upload data packets. So your
ACKs will be delayed getting back to
the remote download server, and it
will therefore believe you are on a
very slow link, and slow down the
transmission of further data to you.
So how do you avoid this? The best way is to implement some sort of traffic-shaping or QoS (Quality of Service) on individual sessions to limit them to a maximum throughput based on a percentage of your total available bandwidth.
For example on my home network I have it so that no outbound connection can utilize any more than 67% (2/3rd) of my 192Kbps uplink. That means any single outbound session can only utilized 128Kbps, therefore protecting my downlink speed by preventing the uplink from becoming saturated.
In most cases you are able to perform this kind of traffic-shaping based on any available criteria such as source ip, destination ip, protocol, port, time of day, etc.
It appears that I was wrong about the simultaneous transfer speeds. The 250KB/s speeds up and down were miscalculated by the transfer program (seemed to have been showing a high average speed). Apparently the Business Ethernet (in this case it is an XO circuit provisioned by Speakeasy) only supports 3Mb total, not up AND down (for 6Mbit total). So if I am transferring up and down at the same time in theory I should only have 1.5Mbit up and down or 187.5KB/s at the maximum (if there was zero overhead).

How do you rate-limit an IO operation?

Suppose you have a program which reads from a socket. How do you keep the download rate below a certain given threshold?
At the application layer (using a Berkeley socket style API) you just watch the clock, and read or write data at the rate you want to limit at.
If you only read 10kbps on average, but the source is sending more than that, then eventually all the buffers between it and you will fill up. TCP/IP allows for this, and the protocol will arrange for the sender to slow down (at the application layer, probably all you need to know is that at the other end, blocking write calls will block, nonblocking writes will fail, and asynchronous writes won't complete, until you've read enough data to allow it).
At the application layer you can only be approximate - you can't guarantee hard limits such as "no more than 10 kb will pass a given point in the network in any one second". But if you keep track of what you've received, you can get the average right in the long run.
Assuming a network transport, a TCP/IP based one, Packets are sent in response to ACK/NACK packets going the other way.
By limiting the rate of packets acknowledging receipt of the incoming packets, you will in turn reduce the rate at which new packets are sent.
It can be a bit imprecise, so its possibly optimal to monitor the downstream rate and adjust the response rate adaptively untill it falls inside a comfortable threshold. ( This will happen really quick however, you send dosens of acks a second )
It is like when limiting a game to a certain number of FPS.
extern int FPS;
....
timePerFrameinMS = 1000/FPS;
while(1) {
time = getMilliseconds();
DrawScene();
time = getMilliseconds()-time;
if (time < timePerFrameinMS) {
sleep(timePerFrameinMS - time);
}
}
This way you make sure that the game refresh rate will be at most FPS.
In the same manner DrawScene can be the function used to pump bytes into the socket stream.
If you're reading from a socket, you have no control over the bandwidth used - you're reading the operating system's buffer of that socket, and nothing you say will make the person writing to the socket write less data (unless, of course, you've worked out a protocol for that).
All that reading slowly would do is fill up the buffer, and cause an eventual stall on the network end - but you have no control of how or when this happens.
If you really want to read only so much data at a time, you can do something like this:
ReadFixedRate() {
while(Data_Exists()) {
t = GetTime();
ReadBlock();
while(t + delay > GetTime()) {
Delay()'
}
}
}
wget seems to manage it with the --limit-rate option. Here's from the man page:
Note that Wget implements the limiting
by sleeping the appropriate amount of
time after a network read that took
less time than specified by the
rate. Eventually this strategy causes
the TCP transfer to slow down to
approximately the specified rate.
However, it may take some time for
this balance to be achieved, so don't
be surprised if limiting the rate
doesn't work well with very small
files.
As other have said, the OS kernel is managing the traffic and you are simply reading a copy of the data out of kernel memory. To roughly limit the rate of just one application, you need to delay your reads of the data and allow incoming packets to buffer up in the kernel, which will eventually slow the acknowledgment of incoming packets and reduce the rate on that one socket.
If you want to slow all traffic to the machine, you need to go and adjust the sizes of your incoming TCP buffers. In Linux, you would affect this change by altering the values in /proc/sys/net/ipv4/tcp_rmem (read memory buffer sizes) and other tcp_* files.
To add to Branan's answer:
If you voluntarily limit the read speed at the receiver end, eventually queues will fill up at both end. Then the sender will either block in its send() call or return from the send() call with a sent_length less than the expected length passed on to the send() call.
If the sender is not ready to deal with this case by sleeping and trying to resend what has not fit into OS buffers, you will ending up have connection issues (the sender may detect this as an error) or losing data (the sender may unknowingly discard data the did not fit into OS buffers).
Set small socket send and receive buffers, say 1k or 2k, such that the bandwidth*delay product = the buffer size. You may not be able to get it small enough over fast links.

Resources