Most efficient way of sending (easily compressible) data over a TCP connection - performance

I developed a TCP server in C/C++ which accepts connections by clients. One of the functionalities is reading arbitrary server memory specified by the client.
Note: Security is no concern here since client and server applications are ran locally only.
Uncompressed memory sending currently works as follows
The client sends the starting address and end address to the server.
The server replies with the memory read between the received starting and end address chunk-wise each time the sending buffer runs full.
The client reads the expected amount of bytes (length = end address - starting address)
Sending large chunks of memory with a potentially high amount of 0 memory is slow so using some sort of compression would seem like a good idea. This makes the communication quite a bit more complicated.
Compressed memory sending currently works as follows
The client sends the starting address and end address to the server.
The server reads a chunk of memory and compresses it with zlib. If the compressed memory is smaller than the original memory, it keeps the compressed one. The server saves the memory size, whether it's compressed or not and the compressed bytes in the sending buffer. When the buffer is full, it is sent back to the client. The send buffer layout is as follows:
Total bytes remaining in the buffer (int) | memory chunks count (int) | list of chunk sizes (int each) | list of whether a chunk is compressed or not (bool each) | list of the data (variable sizes each)
The client reads an int (the total bytes remaining). Then it reads the remaining buffer size using the remaining byte information. Now the client reads the memory chunks count (another int) to be able to parse the list of chunk sizes and the list of whether they are compressed or not. Then using the sizes as well as the compressed information, the client can access the list of data and apply a decompression if necessary. The raw memory buffer is then assembled from all the decompressed received data. Reading from the server continues till the expected amount of raw bytes is assembled.
My question is if the compression approach appears optimal or if I'm missing something important. Sending TCP messages is the bottleneck here so minimizing them while still transmitting the same data should be the key to optimize performance.

Hy, I will give you a few starting point. Remember those are only starting points.
First read this paper:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.156.2302&rep=rep1&type=pdf
and this
https://www.sandvine.com/hubfs/downloads/archive/whitepaper-tcp-optimization-opportunities-kpis-and-considerations.pdf
This will give you a hint what can go wrong and it is a lot. Basically my advice is concentrate on behavior of the server/network system. What I mean try to get it stress tested and try to get a consistent behavior.
If you get a congestion in the system have a strategy for that. Optimize the buffer sizes for the socket. Research how the ring buffers work for the network protocols. Research if you can use jumbo MTU. Test if jitter is a problem in your system. Often because of some higher power the protocols start behaving erratic ( OS is busy, or some memory allocation ).
Now the most important you need to stress test all the time every move you make. Have a consistent reproducable test with that you can test at any point.
If you are on linux setsockopt is your friend and enemy at the same time. Get to know how it works what it does.
Define boundaries what your server must be able to do and what not.
I wish you the best of luck. I'm optimizing my system for latency and it's tricky to say the least.

Related

Measuring client back-up when using Boost.Beast WebSocket

I am reading from a Boost.Beast WebSocket. When my application gets backed up, the websocket sender appears happy to delay/buffer the data on their end (presumably at the application level, as they will delay by 1 minute or more).
What is the best way to measure if I am getting backed up? For example, can I look at the size of a TCP buffer? I could also read all the data into memory in a fast thread, and put it in a queue for the slow thread (in which case, backup can be measured by the size of the queue). But I'm wondering if there's a more direct way.
This varies by platform, but there's the SO_RCVBUF option that sets the amount of data that can be queued onto the socket before TCP pauses receiving more data.
If you have access to the socket, s, invoke this to inspect how much data its rcv buffer size can hold
net::socket_base::receive_buffer_size opt = {};
s.get_option(opt);
You'll probably see that it defaults to something like 64K or so.
Then crank it up real high to like a megabyte:
net::socket_base::receive_buffer_size optSet(1000000);
boost::system::error_code ec;
s.set_option(optSet, ec);
YMMV on how large of a value you can pass to the set_option call and how much actually helps.
Keep in mind, this is only a temporary measure to relieve the pressure. If you keep getting backed up, you'll only hit the limit again, just a bit later and perhaps less often.
I could also read all the data into memory in a fast thread, and put it in a queue for the slow thread
Yes, but you've basically implemented exactly what SO_RCVFROM does. Either that, or you buffer to infinity with respect to memory cost (no limit).

How to send a header+buffer efficiently with TCP/IP?

Suppose, that I have two buffers:
A small header (10-20 bytes)
Payload (varies between ~10KiB - ~10MiB, but usually less than 100KiB)
I need to send this to the server. The server immediately responds when it received the buffer.
Now, if I use two send calls for this (one for the header, and one for the payload), I get very bad latency on windows: after the sends, recv (server response) arrives very slowly (40-100ms). If I turn on TCP_NODELAY for the socket, response comes faster, but I'm still not satisfied with the result (5-10ms. But server responds immediately. Ping time to the server is 0.1ms - it's on a local network).
I read here, that we should avoid write-write-read. But, this is the exact thing what I'm doing now. How should I solve this?
The straightforward solution is to copy the header and payload into a new buffer, and send that instead. But I'd like to avoid this, as I need to allocate a new larger buffer for this (not to mention the copy).
Another solution would be to allocate a smallish buffer (like 4KB), and send the data in chunks. But, this way, I'm not sure how this 4KB chunk will be cut into TCP/IP segments. It may happen, that the last segment will be small, so this causes a little inefficiency.
Linux has MSG_MORE, which may solve this issue on Linux. But I see no Windows equivalent.
What is the most efficient (fast) solution to this problem?

How to avoid buffer overflow on asynchronous non-blocking WSASend calls

This is regarding TCP server which is being developed on Windows. The server will have thousands of clients connected. And server will continuously keep sending random sized bytes (Lets say anything between 1 to 64KB) to clients in non-blocking asynchronous manner.
Currently I do not have any constraint or condition before I call WSASend. I just call it with buffer I got of whatever size, and receive callback (as it is non-blocking call) once data is sent.
The problem is that, if one or few of clients are slow in receiving data, eventually my server's kernel buffer get full and I end up getting buffer overflow (WSAENOBUFS) errors afterwards.
To avoid that, I plan to do like this:
If server has (X) size kernel buffer, and if maximum number of clients would be connected is (N) then I'll allow only (X)/(N) bytes to be written on socket of each client.
(Thus for 50K connection and for kernel buffer size 128 MB, I'll write only maximum 2684 bytes at a time to each socket) And once I receive callback, I can send next set of bytes.
This way even if any of or few clients are slow, it will not result in occupying all of kernel buffer with their pending data.
Now questions are:
Is it correct approach to do this?
If yes, how much
a. Size of kernel buffer (X), and
b. Maximum number of connections to be allowed (N),
should be good to go with for optimum performance.
Note: This is not duplicate of my previous question on same issue. But this is more about validating solution I came up after going through its answer and link I got in answer of the question.
Don't have multiple WSASend() calls outstanding on the same socket. Maintain your own buffer for outgoing data. When you put data in the buffer, if the buffer was previously empty then pass the current buffer content to WSASend(). Each time WSASend() completes, it tells you how many bytes it sent, so remove that many bytes from the front of the buffer, and if the buffer is not empty then call WSASend() again with the remaining content. While WSASend() is busy, if you need to send more data, just append it to the end of your buffer and let WSASend() see it when it is ready.
Let the error tell you. That's what it's for. Just keep sending until you get the error, then stop.

How to explain this incredibly slow socket connection?

I was trying to set up a bandwidth test between two PCs, with only a switch between them. All network hardware is gigabit. One one machine I put a program to open a socket, listen for connections, accept, followed by a loop to read data and measure bytes received against the 'performance counter'. On the other machine, the program opened a socket, connected to the first machine, and proceeds into a tight loop to pump data into the connection as fast as possible, in 1K blocks per send() call. With just that setup, things seem acceptably fast; I could get about 30 to 40 MBytes/sec through the network - distinctly faster than 100BaseT, within the realm of plausibility for gigabit h/w.
Here's where the fun begins: I tried to use setsockopt() to set the size of the buffers (SO_SNDBUF, SO_RCVBUF) on each end to 1K. Suddenly the receiving end reports it's getting a mere 4,000 or 5,000 bytes a second. Instrumenting the transmit side of things, it appears that the send() calls take 0.2 to 0.3 seconds each, just to send 1K blocks. Removing the setsockopt() from the receive side didn't seem to change things.
Now clearly, trying to manipulate the buffer sizes was a Bad Idea. I had thought that maybe forcing the buffer size to 1K, with send() calls of 1K, would be a way to force the OS to put one packet on the wire per send call, with the understanding that this would prevent the network stack from efficiently combining the data for transmission - but I didn't expect throughput to drop to a measly 4-5K/sec!
I don't have time on the resources to chase this down and really understand it the way I'd like to, but would really like to know what could make a send() take 0.2 seconds. Even if it's waiting for acks from the other side, 0.2 seconds is just unbelievable. What gives?
Nagle?
Windows networks with small messages
The explanation is simply that a 1k buffer is an incredibly small buffer size, and your sending machine is probably sending one packet at a time. The sender must wait for the acknowledgement from the receiver before emptying the buffer and accepting the next block to send from your application (because the TCP layer may need to retransmit data later).
A more interesting exercise would be to vary the buffer size from its default for your system (query it to find out what that is) all the way down to 1k and see how each buffer size affects your throughput.

How do you rate-limit an IO operation?

Suppose you have a program which reads from a socket. How do you keep the download rate below a certain given threshold?
At the application layer (using a Berkeley socket style API) you just watch the clock, and read or write data at the rate you want to limit at.
If you only read 10kbps on average, but the source is sending more than that, then eventually all the buffers between it and you will fill up. TCP/IP allows for this, and the protocol will arrange for the sender to slow down (at the application layer, probably all you need to know is that at the other end, blocking write calls will block, nonblocking writes will fail, and asynchronous writes won't complete, until you've read enough data to allow it).
At the application layer you can only be approximate - you can't guarantee hard limits such as "no more than 10 kb will pass a given point in the network in any one second". But if you keep track of what you've received, you can get the average right in the long run.
Assuming a network transport, a TCP/IP based one, Packets are sent in response to ACK/NACK packets going the other way.
By limiting the rate of packets acknowledging receipt of the incoming packets, you will in turn reduce the rate at which new packets are sent.
It can be a bit imprecise, so its possibly optimal to monitor the downstream rate and adjust the response rate adaptively untill it falls inside a comfortable threshold. ( This will happen really quick however, you send dosens of acks a second )
It is like when limiting a game to a certain number of FPS.
extern int FPS;
....
timePerFrameinMS = 1000/FPS;
while(1) {
time = getMilliseconds();
DrawScene();
time = getMilliseconds()-time;
if (time < timePerFrameinMS) {
sleep(timePerFrameinMS - time);
}
}
This way you make sure that the game refresh rate will be at most FPS.
In the same manner DrawScene can be the function used to pump bytes into the socket stream.
If you're reading from a socket, you have no control over the bandwidth used - you're reading the operating system's buffer of that socket, and nothing you say will make the person writing to the socket write less data (unless, of course, you've worked out a protocol for that).
All that reading slowly would do is fill up the buffer, and cause an eventual stall on the network end - but you have no control of how or when this happens.
If you really want to read only so much data at a time, you can do something like this:
ReadFixedRate() {
while(Data_Exists()) {
t = GetTime();
ReadBlock();
while(t + delay > GetTime()) {
Delay()'
}
}
}
wget seems to manage it with the --limit-rate option. Here's from the man page:
Note that Wget implements the limiting
by sleeping the appropriate amount of
time after a network read that took
less time than specified by the
rate. Eventually this strategy causes
the TCP transfer to slow down to
approximately the specified rate.
However, it may take some time for
this balance to be achieved, so don't
be surprised if limiting the rate
doesn't work well with very small
files.
As other have said, the OS kernel is managing the traffic and you are simply reading a copy of the data out of kernel memory. To roughly limit the rate of just one application, you need to delay your reads of the data and allow incoming packets to buffer up in the kernel, which will eventually slow the acknowledgment of incoming packets and reduce the rate on that one socket.
If you want to slow all traffic to the machine, you need to go and adjust the sizes of your incoming TCP buffers. In Linux, you would affect this change by altering the values in /proc/sys/net/ipv4/tcp_rmem (read memory buffer sizes) and other tcp_* files.
To add to Branan's answer:
If you voluntarily limit the read speed at the receiver end, eventually queues will fill up at both end. Then the sender will either block in its send() call or return from the send() call with a sent_length less than the expected length passed on to the send() call.
If the sender is not ready to deal with this case by sleeping and trying to resend what has not fit into OS buffers, you will ending up have connection issues (the sender may detect this as an error) or losing data (the sender may unknowingly discard data the did not fit into OS buffers).
Set small socket send and receive buffers, say 1k or 2k, such that the bandwidth*delay product = the buffer size. You may not be able to get it small enough over fast links.

Resources