How to send a header+buffer efficiently with TCP/IP? - windows

Suppose, that I have two buffers:
A small header (10-20 bytes)
Payload (varies between ~10KiB - ~10MiB, but usually less than 100KiB)
I need to send this to the server. The server immediately responds when it received the buffer.
Now, if I use two send calls for this (one for the header, and one for the payload), I get very bad latency on windows: after the sends, recv (server response) arrives very slowly (40-100ms). If I turn on TCP_NODELAY for the socket, response comes faster, but I'm still not satisfied with the result (5-10ms. But server responds immediately. Ping time to the server is 0.1ms - it's on a local network).
I read here, that we should avoid write-write-read. But, this is the exact thing what I'm doing now. How should I solve this?
The straightforward solution is to copy the header and payload into a new buffer, and send that instead. But I'd like to avoid this, as I need to allocate a new larger buffer for this (not to mention the copy).
Another solution would be to allocate a smallish buffer (like 4KB), and send the data in chunks. But, this way, I'm not sure how this 4KB chunk will be cut into TCP/IP segments. It may happen, that the last segment will be small, so this causes a little inefficiency.
Linux has MSG_MORE, which may solve this issue on Linux. But I see no Windows equivalent.
What is the most efficient (fast) solution to this problem?

Related

Most efficient way of sending (easily compressible) data over a TCP connection

I developed a TCP server in C/C++ which accepts connections by clients. One of the functionalities is reading arbitrary server memory specified by the client.
Note: Security is no concern here since client and server applications are ran locally only.
Uncompressed memory sending currently works as follows
The client sends the starting address and end address to the server.
The server replies with the memory read between the received starting and end address chunk-wise each time the sending buffer runs full.
The client reads the expected amount of bytes (length = end address - starting address)
Sending large chunks of memory with a potentially high amount of 0 memory is slow so using some sort of compression would seem like a good idea. This makes the communication quite a bit more complicated.
Compressed memory sending currently works as follows
The client sends the starting address and end address to the server.
The server reads a chunk of memory and compresses it with zlib. If the compressed memory is smaller than the original memory, it keeps the compressed one. The server saves the memory size, whether it's compressed or not and the compressed bytes in the sending buffer. When the buffer is full, it is sent back to the client. The send buffer layout is as follows:
Total bytes remaining in the buffer (int) | memory chunks count (int) | list of chunk sizes (int each) | list of whether a chunk is compressed or not (bool each) | list of the data (variable sizes each)
The client reads an int (the total bytes remaining). Then it reads the remaining buffer size using the remaining byte information. Now the client reads the memory chunks count (another int) to be able to parse the list of chunk sizes and the list of whether they are compressed or not. Then using the sizes as well as the compressed information, the client can access the list of data and apply a decompression if necessary. The raw memory buffer is then assembled from all the decompressed received data. Reading from the server continues till the expected amount of raw bytes is assembled.
My question is if the compression approach appears optimal or if I'm missing something important. Sending TCP messages is the bottleneck here so minimizing them while still transmitting the same data should be the key to optimize performance.
Hy, I will give you a few starting point. Remember those are only starting points.
First read this paper:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.156.2302&rep=rep1&type=pdf
and this
https://www.sandvine.com/hubfs/downloads/archive/whitepaper-tcp-optimization-opportunities-kpis-and-considerations.pdf
This will give you a hint what can go wrong and it is a lot. Basically my advice is concentrate on behavior of the server/network system. What I mean try to get it stress tested and try to get a consistent behavior.
If you get a congestion in the system have a strategy for that. Optimize the buffer sizes for the socket. Research how the ring buffers work for the network protocols. Research if you can use jumbo MTU. Test if jitter is a problem in your system. Often because of some higher power the protocols start behaving erratic ( OS is busy, or some memory allocation ).
Now the most important you need to stress test all the time every move you make. Have a consistent reproducable test with that you can test at any point.
If you are on linux setsockopt is your friend and enemy at the same time. Get to know how it works what it does.
Define boundaries what your server must be able to do and what not.
I wish you the best of luck. I'm optimizing my system for latency and it's tricky to say the least.

Is ZeroMQ slower than boost asio?

I am trying to write a network transfer application.
The data is binary data and each packet size is mostly 800KB.
The client produces 1000 data per second. I want transfer data as quick as possible.
When I use ZeroMQ, the speed hits 350 data per second, but the boost asio hits 400(or more) per second.
As you can see the performance of both methods is not good.
The pattern used for ZeroMQ is a PUSH/PULL pattern, the boost asio is simple sync I/O.
Q1: I want to ask, is ZeroMQ only suitable for small messages?
Q2: Is there a way to improve the ZeroMQ speed?
Q3: If ZeroMQ can't, please advice some good method or library to improve these kind of data transfer.
Data Rate
You're attempting to move 800 MByte/second. What sort of connection is this? For a tcp:// transport-class it'd have to something pretty rapid, e.g. 100 Gbit/s Ethernet, which is pretty exotic.
So I'm presuming that it's an ipc:// transport-class connection. In which case you can get an improvement, using ZeroMQ zerocopy functions, which saves copying the data repeatedly.
With a normal transfer, you have to copy data into a zmq message, that has to be copied into an ipc pipe, copied out again, and copied back into a new zmq message at the receiving end. All that copying requires 4 x 800 = 2.4 GByte/sec memory bandwidth which, by the time cache conflicts have come into play, is an appreciable percentage of the total memory bandwidth of a typical PC system. Using zerocopy should cut that in half.
Alternative to Zero Copy - Zero Transfer
If you are using ipc://, then consider not sending data through the sockets, but sending references to the data through the sockets.
I have previously blended use of zmq and a semaphore locked C++ stl::queue, using zmq simply for it's pattern ( PUSH/PULL in my case ), the stl::queue to carry shared pointers to data, and leave the data still. The sender locks the queue, puts a shared pointer into it, and then sends a simple message ( e.g. "1" ) through a zmq socket. The recipient reads the "1" and uses that as a cue to lock the queue and pull a shared pointer off it. Thus a shared pointer to data has been transferred from one thread to another in a ZMQ pattern via a stl::queue, but the data itself has stayed still. All I've done is pass ownership of the data between threads. It works so long as the shared pointer that the send has goes out of scope immediately after sending and is not used by the sender to modify or access the data.
PUSH/PULL is not too bad to deal with - each message goes to only one recipient. It would take more effort to make such a blend with PUB/SUB, and received messages would have to be treated as read-only because each recipient would have a shared pointer to the same block of data as everyone else.
Message Size
I've not idea how big a chunk zmqtp transfers at a time, but I'd guess that it's relatively efficient in terms of protocol:data ratio.

How to avoid buffer overflow on asynchronous non-blocking WSASend calls

This is regarding TCP server which is being developed on Windows. The server will have thousands of clients connected. And server will continuously keep sending random sized bytes (Lets say anything between 1 to 64KB) to clients in non-blocking asynchronous manner.
Currently I do not have any constraint or condition before I call WSASend. I just call it with buffer I got of whatever size, and receive callback (as it is non-blocking call) once data is sent.
The problem is that, if one or few of clients are slow in receiving data, eventually my server's kernel buffer get full and I end up getting buffer overflow (WSAENOBUFS) errors afterwards.
To avoid that, I plan to do like this:
If server has (X) size kernel buffer, and if maximum number of clients would be connected is (N) then I'll allow only (X)/(N) bytes to be written on socket of each client.
(Thus for 50K connection and for kernel buffer size 128 MB, I'll write only maximum 2684 bytes at a time to each socket) And once I receive callback, I can send next set of bytes.
This way even if any of or few clients are slow, it will not result in occupying all of kernel buffer with their pending data.
Now questions are:
Is it correct approach to do this?
If yes, how much
a. Size of kernel buffer (X), and
b. Maximum number of connections to be allowed (N),
should be good to go with for optimum performance.
Note: This is not duplicate of my previous question on same issue. But this is more about validating solution I came up after going through its answer and link I got in answer of the question.
Don't have multiple WSASend() calls outstanding on the same socket. Maintain your own buffer for outgoing data. When you put data in the buffer, if the buffer was previously empty then pass the current buffer content to WSASend(). Each time WSASend() completes, it tells you how many bytes it sent, so remove that many bytes from the front of the buffer, and if the buffer is not empty then call WSASend() again with the remaining content. While WSASend() is busy, if you need to send more data, just append it to the end of your buffer and let WSASend() see it when it is ready.
Let the error tell you. That's what it's for. Just keep sending until you get the error, then stop.

How to explain this incredibly slow socket connection?

I was trying to set up a bandwidth test between two PCs, with only a switch between them. All network hardware is gigabit. One one machine I put a program to open a socket, listen for connections, accept, followed by a loop to read data and measure bytes received against the 'performance counter'. On the other machine, the program opened a socket, connected to the first machine, and proceeds into a tight loop to pump data into the connection as fast as possible, in 1K blocks per send() call. With just that setup, things seem acceptably fast; I could get about 30 to 40 MBytes/sec through the network - distinctly faster than 100BaseT, within the realm of plausibility for gigabit h/w.
Here's where the fun begins: I tried to use setsockopt() to set the size of the buffers (SO_SNDBUF, SO_RCVBUF) on each end to 1K. Suddenly the receiving end reports it's getting a mere 4,000 or 5,000 bytes a second. Instrumenting the transmit side of things, it appears that the send() calls take 0.2 to 0.3 seconds each, just to send 1K blocks. Removing the setsockopt() from the receive side didn't seem to change things.
Now clearly, trying to manipulate the buffer sizes was a Bad Idea. I had thought that maybe forcing the buffer size to 1K, with send() calls of 1K, would be a way to force the OS to put one packet on the wire per send call, with the understanding that this would prevent the network stack from efficiently combining the data for transmission - but I didn't expect throughput to drop to a measly 4-5K/sec!
I don't have time on the resources to chase this down and really understand it the way I'd like to, but would really like to know what could make a send() take 0.2 seconds. Even if it's waiting for acks from the other side, 0.2 seconds is just unbelievable. What gives?
Nagle?
Windows networks with small messages
The explanation is simply that a 1k buffer is an incredibly small buffer size, and your sending machine is probably sending one packet at a time. The sender must wait for the acknowledgement from the receiver before emptying the buffer and accepting the next block to send from your application (because the TCP layer may need to retransmit data later).
A more interesting exercise would be to vary the buffer size from its default for your system (query it to find out what that is) all the way down to 1k and see how each buffer size affects your throughput.

Ensuring packet order in UDP

I'm using 2 computers with an application to send and receive udp datagrams. There is no flow control and ICMP is disabled. Frequently when I send a file as UDP datagrams via the application, I get two packets changing their order and therefore - packet loss.
I've disabled and kind of firewall and there is no hardware switch connected between the computers (they are directly wired).
Is there a way to make sure Winsock and send() will send the packets the same way they got there?
Or is the OS doing that?
Or network device configuration needed?
UDP is a lightweight protocol that by design doesn't handle things like packet sequencing. TCP is a better choice if you want robust packet delivery and sequencing.
UDP is generally designed for applications where packet loss is acceptable or preferable to the delay which TCP incurs when it has to re-request packets. UDP is therefore commonly used for media streaming.
If you're limited to using UDP you would have to develop a method of identifying the out of sequence packets and resequencing them.
UDP does not guarantee that your packets will arrive in order. (It does not even guarantee that your packets will arrive at all.) If you need that level of robustness you are better off with TCP. Alternatively you could add sequence markers to your datagrams and rearrange them at the other end, but why reinvent the wheel?
is there a way to make sure winsock and send() will send the packets the same way they got there?
It's called TCP.
Alternatively try a reliable UDP protocol such as UDT. I'm guessing you might be on a small embedded platform so you want a more compact protocol like Bell Lab's RUDP.
there is no flow control (ICMP disabled)
You can implement your own flow control using UDP:
Send one or more UDP packets
Wait for acknowledgement (sent as another UDP packets from receiver to sender)
Repeat as above
See Sliding window protocol for further details.
[This would be in addition to having a sequence number in the packets which you send.]
There is no point in trying to create your own TCP like wrapper. We love the speed of UPD and that is just going to slow things down. Your problem can be overcome if you design your protocol so that every UDP datagram is independent of each other. Our packets can arrive in any order so long as the header packet arrives first. The header says how many packets are suppose to arrive. Also, UPD has become a lot more reliable since this post was created over a decade ago. Don't try to
This question is 12 years old, and it seems almost a waste to answer it now. Even as the suggestions that I have have already been posed. I dealt with this issue back in 2002, in a program that was using UDP broadcasts to communicate with other running instances on the network. If a packet got lost, it wasn't a big deal. But if I had to send a large packet, greater than 1020 bytes, I broke it up into multiple packets. Each packet contained a header that described what packet number it was, along with a header that told me it was part of a larger overall packet. So, the structure was created, and the payload was simply dropped in the (correct) place in the buffer, and the bytes were subtracted from the overall total that was needed. I knew all the packets had arrived once the needed byte total reached zero. Once all of the packets arrived, that packet got processed. If another advertisement packet came in, then everything that had been building up was thrown away. That told me that one of the fragments didn't make it. But again, this wasn't critical data; the code could live without it. But, I did implement an AdvReplyType in every packet, so that if it was a critical packet, I could reply to the sender with an ADVERTISE_INCOMPLETE_REQUEST_RETRY packet type, and the whole process could start over again.
This whole system was designed for LAN operation, and in all of my debugging/beta testing, I rarely ever lost a packet, but on larger networks I would often get them out of order...but I did get them. Being that it's now 12 years later, and UDP broadcasting seems to be frowned upon by a lot of IT Admins, UDP doesn't seem like a good, solid system any longer. ChrisW mentioned a Sliding Window Protocol; this is sort of what I built...without the sliding part! More of a "Fixed Window Protocol". I just wasted a few more bytes in the header of each of the payload packets to tell how many total bytes are in this Overlapped Packet, which packet this was, and the unique MsgID it belonged to so that I didn't have to get the initial packet telling me how many packets to expect. Naturally, I didn't go as far as implementing RFC 1982, as that seemed like overkill for this. As long as I got one packet, I'd know the total length, unique Message Id, and which packet number this one was, making it pretty easy to malloc() a buffer large enough to hold the entire Message. Then, a little math could tell me where exactly in the Message this packet fits into. Once the Message buffer was filled in...I knew I got the whole message. If a packet arrived that didn't belong to this unique Message ID, then we knew this was a bust, and we likely weren't going to ever get the remainder of the old message.
The only real reason I mention this today, is that I believe there still is a time and a place to use a protocol like this. Where TCP actually involves too much overhead, on slow, or spotty networks; but it's those where you also have the most likelihood of and possible fear of packet loss. So, again, I'd also say that "reliability" cannot be a requirement, or you're just right back to TCP. If I had to write this code today, I probably would have just implemented a Multicast system, and the whole process probably would have been a lot easier on me. Maybe. It has been 12 years, and I've probably forgotten a huge portion of the implementation details.
Sorry, if I woke a sleeping giant here, that wasn't my intention. The original question intrigued me, and reminded me of this turn-of-the-century Windows C++ code I had written. So, please try to keep the negative comments to a minimum--if at all possible! (Positive comments, of course...always welcome!) J/K, of course, folks.

Resources