Is ZeroMQ slower than boost asio? - performance

I am trying to write a network transfer application.
The data is binary data and each packet size is mostly 800KB.
The client produces 1000 data per second. I want transfer data as quick as possible.
When I use ZeroMQ, the speed hits 350 data per second, but the boost asio hits 400(or more) per second.
As you can see the performance of both methods is not good.
The pattern used for ZeroMQ is a PUSH/PULL pattern, the boost asio is simple sync I/O.
Q1: I want to ask, is ZeroMQ only suitable for small messages?
Q2: Is there a way to improve the ZeroMQ speed?
Q3: If ZeroMQ can't, please advice some good method or library to improve these kind of data transfer.

Data Rate
You're attempting to move 800 MByte/second. What sort of connection is this? For a tcp:// transport-class it'd have to something pretty rapid, e.g. 100 Gbit/s Ethernet, which is pretty exotic.
So I'm presuming that it's an ipc:// transport-class connection. In which case you can get an improvement, using ZeroMQ zerocopy functions, which saves copying the data repeatedly.
With a normal transfer, you have to copy data into a zmq message, that has to be copied into an ipc pipe, copied out again, and copied back into a new zmq message at the receiving end. All that copying requires 4 x 800 = 2.4 GByte/sec memory bandwidth which, by the time cache conflicts have come into play, is an appreciable percentage of the total memory bandwidth of a typical PC system. Using zerocopy should cut that in half.
Alternative to Zero Copy - Zero Transfer
If you are using ipc://, then consider not sending data through the sockets, but sending references to the data through the sockets.
I have previously blended use of zmq and a semaphore locked C++ stl::queue, using zmq simply for it's pattern ( PUSH/PULL in my case ), the stl::queue to carry shared pointers to data, and leave the data still. The sender locks the queue, puts a shared pointer into it, and then sends a simple message ( e.g. "1" ) through a zmq socket. The recipient reads the "1" and uses that as a cue to lock the queue and pull a shared pointer off it. Thus a shared pointer to data has been transferred from one thread to another in a ZMQ pattern via a stl::queue, but the data itself has stayed still. All I've done is pass ownership of the data between threads. It works so long as the shared pointer that the send has goes out of scope immediately after sending and is not used by the sender to modify or access the data.
PUSH/PULL is not too bad to deal with - each message goes to only one recipient. It would take more effort to make such a blend with PUB/SUB, and received messages would have to be treated as read-only because each recipient would have a shared pointer to the same block of data as everyone else.
Message Size
I've not idea how big a chunk zmqtp transfers at a time, but I'd guess that it's relatively efficient in terms of protocol:data ratio.

Related

Using NdisFIndicateReceiveNetBufferLists for every packet vs chaining them all together to receive?

I have an NDIS driver where i send received packets to the user service, then the service marks those packets that are OK (not malicious), then i iterate over the packets that are good to receive then i send them one by one by by converting each of them back to a proper NetBufferList with one NetBuffer and then i indicate them using NdisFIndicateReceiveNetBufferLists.
This caused a problem that in large file transfers through SMB (copying files from shares), which reduced the transfer speed significantly.
As a workaround, i now chain all of the NBLs that are OK altogether (instead of sending them one by one), and then send all of them at once via NdisFIndicateReceiveNetBufferLists.
My question is, will this change cause any issue? Any difference between sending X number of NBLs one by one vs chaining them together and sending all of them at once? (since most of them might be related to different flows/apps)
Also, the benefit of chaining packets together is much greater in multi packet receive compared to multi packet send via FilterSendNetBufferLists, why is that?
An NET_BUFFER represents one single network frame. (With some appropriate hand-waving for LSO/RSC.)
An NET_BUFFER_LIST is a collection of related NET_BUFFERs. Each NET_BUFFER on the same NET_BUFFER_LIST belong to the same "traffic flow" (more on that later), they have all the same metadata and will have all the same offloads performed on them. So we use the NET_BUFFER_LIST to group related packets and to have them share metadata.
The datapath generally operates on batches of multiple NET_BUFFER_LISTs. The entire batch is only grouped together for performance reasons; there's not a lot of implied relation between multiple NBLs within a batch. Exception: most datapath routines take a Flags parameter that can hold flags that make some claims about all the NBLs in a batch, for example, NDIS_RECEIVE_FLAGS_SINGLE_ETHER_TYPE.
So to summarize, you can indeed safely group multiple NET_BUFFER_LISTs into a single indication, and this is particularly important for perf. You can group unrelated NBLs together, if you like. However, if you are combining batches of NBLs, make sure you clear out any NDIS_XXX_FLAGS_SINGLE_XXX style flags. (Unless, of course, you know that the flags' promise still holds. For example, if you're combining 2 batches of NBLs that both had the NDIS_RECEIVE_FLAGS_SINGLE_ETHER_TYPE flag, and if you verify that the first NBL in each batch has the same EtherType, then it is actually safe to preserve the NDIS_RECEIVE_FLAGS_SINGLE_ETHER_TYPE flag.)
However note that you generally cannot combine multiple NET_BUFFERs into the same NET_BUFFER_LIST, unless you control the application that generated the payload and you know that the NET_BUFFERs' payloads belong to the same traffic flow. The exact semantics of a traffic flow are a little fuzzy down in the NDIS layer, but you can imagine it means that any NDIS-level hardware offload can safely treat each packet as the same. For example, an IP checksum offload needs to know that each packet has the same pseudo-header. If all the packets belong to the same TCP or UDP socket, then they can be treated as the same flow.
Also, the benefit of chaining packets together is much greater in multi packet receive compared to multi packet send via FilterSendNetBufferLists, why is that?
Receive is the expensive path, for two reasons. First, the OS has to spend CPU to demux the raw stream of packets coming in from the network. The network could send us packets from any random socket, or packets that don't match any socket at all, and the OS has to be prepared for any possibility. Secondly, the receive path handles untrusted data, so it has to be cautious about parsing.
In comparison, the send path is super cheap: the packets just fall down to the miniport driver, who sets up a DMA and they're blasted to hardware. Nobody in the send path really cares what's actually in the packet (the firewall already ran before NDIS saw the packets, so you don't see that cost; and if the miniport is doing any offload, that's paid on the hardware's built-in processor, so it doesn't show up on any CPU you can see in Task Manager.)
So if you take a batch of 100 packets and break it into 100 calls of 1 packet on the receive path, the OS has to grind through 100 calls of some expensive parsing functions. Meanwhile, 100 calls through the send path isn't great, but it'll be only a fraction of the CPU costs of the receive path.

Measuring client back-up when using Boost.Beast WebSocket

I am reading from a Boost.Beast WebSocket. When my application gets backed up, the websocket sender appears happy to delay/buffer the data on their end (presumably at the application level, as they will delay by 1 minute or more).
What is the best way to measure if I am getting backed up? For example, can I look at the size of a TCP buffer? I could also read all the data into memory in a fast thread, and put it in a queue for the slow thread (in which case, backup can be measured by the size of the queue). But I'm wondering if there's a more direct way.
This varies by platform, but there's the SO_RCVBUF option that sets the amount of data that can be queued onto the socket before TCP pauses receiving more data.
If you have access to the socket, s, invoke this to inspect how much data its rcv buffer size can hold
net::socket_base::receive_buffer_size opt = {};
s.get_option(opt);
You'll probably see that it defaults to something like 64K or so.
Then crank it up real high to like a megabyte:
net::socket_base::receive_buffer_size optSet(1000000);
boost::system::error_code ec;
s.set_option(optSet, ec);
YMMV on how large of a value you can pass to the set_option call and how much actually helps.
Keep in mind, this is only a temporary measure to relieve the pressure. If you keep getting backed up, you'll only hit the limit again, just a bit later and perhaps less often.
I could also read all the data into memory in a fast thread, and put it in a queue for the slow thread
Yes, but you've basically implemented exactly what SO_RCVFROM does. Either that, or you buffer to infinity with respect to memory cost (no limit).

Which is faster for IPC sharing of objects -ZeroMQ or Boost::interprocess?

I am contemplating inter-process sharing of custom objects. My current implementation uses ZeroMQ where the objects are packed into a message and sent from process A to process B.
I am wondering whether it would be faster instead to have a concurrent container implemented using boost::interprocess (where process A will insert into the container and process B will retrieve from it). Not sure if this will be faster than having to serialise the object in process A and then de-serialising it in process B.
Just wondering if anyone has done benchmarking? Is it conceptually right to compare the two?
In principle, ZeroMq should be slower, because the metaphor it's using is the passing of messages. These kinds of libraries are not intended for sharing regions of memory, in place, and for different processes to be able to modify them concurrently.
Specifically, you mentioned "packing". When sharing memory regions, you can - ideally - avoid any packing and just work on data as-is (of course, with the care necessary in concurrent use of the same data structures, using offsets instead of pointers etc.)
Also note that even when sharing is a one-directional back-and-forth (i.e. only one process at a time accesses any of the data), ZeroMq can only match the use of IPC shared memory if it supports zero-copying all the way down. This is not clear to me from the FAQ page on zero-copying (but may be the case anyway).
I agree with Nim, they're too different for easy comparison.
ZeroMQ has inproc which uses shared memory as a byte transport.
Boost.Interprocess seems to be mostly about having objects constructed in shared memory, accessible to multiple processes / threads. However it does have message queues, but they too are just byte transports requiring objects to be serialised, just like you have to with ZeroMQ. They're not object containers, so are more comparable to ZeroMQ but is quite a long way from what Boost.Interprocess seems to represent.
I have done a ZeroMQ / STL container hybrid. Yeurk. I used a C++ STL queue to store objects, but then used a ZeroMQ PUSH/PULL socket to govern which thread could read from that queue. Reading threads were blocked on a ZeroMQ poll, and when they received a message they'd lock the queue and read an object out from it. This avoided having to serialise objects, which was handy, so it was pretty fast. This doesn't work for PUB/SUB which implies copying objects between recipients, which would need object serialisation.
ZMQ IPC is effective only in linux(using UNIX domain socket)
The performance is slower than boost::interprocess shared_memory

Efficient Overlapped I/O for a socket server

Which of these two different models would be more efficient (consider thrashing, utilization of processor cache, overall desgn, everything, etc)?
1 IOCP and spinning up X threads (where X is the number of processors the computer has). This would mean that my "server" would only have 1 IOCP (queue) for all requests and X Threads to serve/handle them. I have read many articles discussing the effeciency of this design. With this model I would have 1 listener that would also be associated to the IOCP. Lets assume that I could figure out how to keep the packets/requests synchronized.
X IOCP (where X is the number of processors the computer has) and each IOCP has 1 thread. This would mean that each Processor has its own queue and 1 thread to serve/handle them. With this model I would have a separate Listener (not using IOCP) that would handle incomming connections and would assign the SOCKET to the proper IOCP (one of the X that were created). Lets assume that I could figure out the Load Balancing.
Using an overly simplified analogy for the two designs (a bank):
One line with several cashiers to hand the transactions. Each person is in the same line and each cashier takes the next available person in line.
Each cashier has their own line and the people are "placed" into one of those lines
Between these two designs, which one is more efficient. In each model the Overlapped I/O structures would be using VirtualAlloc with MEM_COMMIT (as opposed to "new") so the swap-file should not be an issue (no paging). Based on how it has been described to me, using VirtualAlloc with MEM_COMMIT, the memory is reserved and is not paged out. This would allow the SOCKETS to write the incomming data right to my buffers without going through intermediate layers. So I don't think thrashing should be a factor but I might be wrong.
Someone was telling me that #2 would be more efficient but I have not heard of this model. Thanks in advance for your comments!
I assume that for #2 you plan to manually associate your sockets with an IOCP that you decide is 'best' based on some measure of 'goodness' at the time the socket is accepted? And that somehow this measure of 'goodness' will persist for the life of the socket?
With IOCP used the 'standard' way, i.e. your option number 1, the kernel works out how best to use the threads you have and allows more to run if any of them block. With your method, assuming you somehow work out how to distribute the work, you are going to end up with more threads running than with option 1.
Your #2 option also prevents you from using AcceptEx() for overlapped accepts and this is more efficient than using a normal accept loop as you remove a thread (and the resulting context switching and potential contention) from the scene.
Your analogy breaks down; it's actually more a case of either having 1 queue with X bank tellers where you join the queue and know that you'll be seen in an efficient order as opposed to each teller having their own queue and you having to guess that the queue you join doesn't contain a whole bunch of people who want to open new accounts and the one next to you contains a whole bunch of people who only want to do some paying in. The single queue ensures that you get handled efficiently.
I think you're confused about MEM_COMMIT. It doesn't mean that the memory isn't in the paging file and wont be paged. The usual reason for using VirtualAlloc for overlapped buffers is to ensure alignment on page boundaries and so reduce the number of pages that are locked for I/O (a page sized buffer can be allocated on a page boundary and so only take one page rather than happening to span two due to the memory manager deciding to use a block that doesn't start on a page boundary).
In general I think you're attempting to optimise something way ahead of schedule. Get an efficient server working using IOCP the normal way first and then profile it. I seriously doubt that you'll even need to worry about building your #2 version ... Likewise, use new to allocate your buffers to start with and then switch to the added complexity of VirtualAlloc() when you find that you server fails due to ENOBUFS and you're sure that's caused by the I/O locked page limit and not lack of non-paged pool (you do realise that you have to allocate in 'allocation granularity' sized chunks for VirtualAlloc()?).
Anyway, I have a free IOCP server framework that's available here: http://www.serverframework.com/products---the-free-framework.html which might help you get started.
Edited: The complex version that you suggest could be useful in some NUMA architectures where you use NIC teaming to have the switch spit your traffic across multiple NICs, bind each NIC to a different physical processor and then bind your IOCP threads to the same processor. You then allocate memory from that NUMA node and effectively have your network switch load balance your connections across your NUMA nodes. I'd still suggest that it's better, IMHO, to get a working server which you can profile using the "normal" method of using IOCP first and only once you know that cross NUMA node issues are actually affecting your performance move towards the more complex architecture...
Queuing theory tells us that a single queue has better characteristics than multiple queues. You could possibly get around this with work-stealing.
The multiple queues method should have better cache behavior. Whether it is significantly better depends on how many received packets are associated with a single transaction. If a request fits in a single incoming packet, then it'll be associated to a single thread even with the single IOCP approach.

How do you rate-limit an IO operation?

Suppose you have a program which reads from a socket. How do you keep the download rate below a certain given threshold?
At the application layer (using a Berkeley socket style API) you just watch the clock, and read or write data at the rate you want to limit at.
If you only read 10kbps on average, but the source is sending more than that, then eventually all the buffers between it and you will fill up. TCP/IP allows for this, and the protocol will arrange for the sender to slow down (at the application layer, probably all you need to know is that at the other end, blocking write calls will block, nonblocking writes will fail, and asynchronous writes won't complete, until you've read enough data to allow it).
At the application layer you can only be approximate - you can't guarantee hard limits such as "no more than 10 kb will pass a given point in the network in any one second". But if you keep track of what you've received, you can get the average right in the long run.
Assuming a network transport, a TCP/IP based one, Packets are sent in response to ACK/NACK packets going the other way.
By limiting the rate of packets acknowledging receipt of the incoming packets, you will in turn reduce the rate at which new packets are sent.
It can be a bit imprecise, so its possibly optimal to monitor the downstream rate and adjust the response rate adaptively untill it falls inside a comfortable threshold. ( This will happen really quick however, you send dosens of acks a second )
It is like when limiting a game to a certain number of FPS.
extern int FPS;
....
timePerFrameinMS = 1000/FPS;
while(1) {
time = getMilliseconds();
DrawScene();
time = getMilliseconds()-time;
if (time < timePerFrameinMS) {
sleep(timePerFrameinMS - time);
}
}
This way you make sure that the game refresh rate will be at most FPS.
In the same manner DrawScene can be the function used to pump bytes into the socket stream.
If you're reading from a socket, you have no control over the bandwidth used - you're reading the operating system's buffer of that socket, and nothing you say will make the person writing to the socket write less data (unless, of course, you've worked out a protocol for that).
All that reading slowly would do is fill up the buffer, and cause an eventual stall on the network end - but you have no control of how or when this happens.
If you really want to read only so much data at a time, you can do something like this:
ReadFixedRate() {
while(Data_Exists()) {
t = GetTime();
ReadBlock();
while(t + delay > GetTime()) {
Delay()'
}
}
}
wget seems to manage it with the --limit-rate option. Here's from the man page:
Note that Wget implements the limiting
by sleeping the appropriate amount of
time after a network read that took
less time than specified by the
rate. Eventually this strategy causes
the TCP transfer to slow down to
approximately the specified rate.
However, it may take some time for
this balance to be achieved, so don't
be surprised if limiting the rate
doesn't work well with very small
files.
As other have said, the OS kernel is managing the traffic and you are simply reading a copy of the data out of kernel memory. To roughly limit the rate of just one application, you need to delay your reads of the data and allow incoming packets to buffer up in the kernel, which will eventually slow the acknowledgment of incoming packets and reduce the rate on that one socket.
If you want to slow all traffic to the machine, you need to go and adjust the sizes of your incoming TCP buffers. In Linux, you would affect this change by altering the values in /proc/sys/net/ipv4/tcp_rmem (read memory buffer sizes) and other tcp_* files.
To add to Branan's answer:
If you voluntarily limit the read speed at the receiver end, eventually queues will fill up at both end. Then the sender will either block in its send() call or return from the send() call with a sent_length less than the expected length passed on to the send() call.
If the sender is not ready to deal with this case by sleeping and trying to resend what has not fit into OS buffers, you will ending up have connection issues (the sender may detect this as an error) or losing data (the sender may unknowingly discard data the did not fit into OS buffers).
Set small socket send and receive buffers, say 1k or 2k, such that the bandwidth*delay product = the buffer size. You may not be able to get it small enough over fast links.

Resources