Average performance measurements for local IPC - performance

We are now assessing different IPC (or rather RPC) methods for our current project, which is in its very early stages. Performance is a big deal, and so we are making some measurements to aid our choice. Our processes that will be communicating will reside on the same machine.
A separate valid option is to avoid IPC altogether (by encapsulating the features of one of the processes in a .NET DLL and having the other one use it), but this is an option we would really like to avoid, as these two pieces of software are developed by two separate companies and we find it very important to maintain good "fences", which make good neighbors.
Our tests consisted of passing messages (which contain variously sized BLOBs) across process boundaries using each method. These are the figures we get (performance range correlates with message size range):
Web Service (SOAP over HTTP):
25-30 MB/s when binary data is encoded as Base64 (default)
70-100 MB/s when MTOM is utilized
.NET Remoting (BinaryFormatter over TCP): 100-115 MB/s
Control group - DLL method call + mem copy: 800-1000 MB/s
Now, we've been looking all over the place for some average performance figures for these (and other) IPC methods, including performance of raw TCP loopback sockets, but couldn't find any. Do these figures look sane? Why is the performance of these local IPC methods at least 10 times slower than copying memory? I couldn't get better results even when I used raw sockets - is the overhead of TCP that big?

Shared memory is the fastest.
A producer process can put its output into memory shared between processes and notify other processes that the shared data has been updated. On Linux you naturally put a mutex and a condition variable in that same shared memory so that other processes can wait for updates on the condition variable.

Memory-mapped files + synchronization objects is the right way to go (almost the same as shared memory, but with more control). Sockets are way too slow for local communications. Especially it sometimes happens that network drivers are slower with localhost, than over network.

Several parts of our system have been redesigned so that we don't have to pass 30MB messages around, but rather 3MB. This allowed us to choose .NET Remoting with BinaryFormatter over named pipes (IpcChannel), which gives satisfactory results.
Our contingency plan (in case we ever do need to pass 30MB messages around) is to pass protobuf-serialized messages over named pipes manually. We have determined that this also provides satisfactory results.

Related

Which is faster for IPC sharing of objects -ZeroMQ or Boost::interprocess?

I am contemplating inter-process sharing of custom objects. My current implementation uses ZeroMQ where the objects are packed into a message and sent from process A to process B.
I am wondering whether it would be faster instead to have a concurrent container implemented using boost::interprocess (where process A will insert into the container and process B will retrieve from it). Not sure if this will be faster than having to serialise the object in process A and then de-serialising it in process B.
Just wondering if anyone has done benchmarking? Is it conceptually right to compare the two?
In principle, ZeroMq should be slower, because the metaphor it's using is the passing of messages. These kinds of libraries are not intended for sharing regions of memory, in place, and for different processes to be able to modify them concurrently.
Specifically, you mentioned "packing". When sharing memory regions, you can - ideally - avoid any packing and just work on data as-is (of course, with the care necessary in concurrent use of the same data structures, using offsets instead of pointers etc.)
Also note that even when sharing is a one-directional back-and-forth (i.e. only one process at a time accesses any of the data), ZeroMq can only match the use of IPC shared memory if it supports zero-copying all the way down. This is not clear to me from the FAQ page on zero-copying (but may be the case anyway).
I agree with Nim, they're too different for easy comparison.
ZeroMQ has inproc which uses shared memory as a byte transport.
Boost.Interprocess seems to be mostly about having objects constructed in shared memory, accessible to multiple processes / threads. However it does have message queues, but they too are just byte transports requiring objects to be serialised, just like you have to with ZeroMQ. They're not object containers, so are more comparable to ZeroMQ but is quite a long way from what Boost.Interprocess seems to represent.
I have done a ZeroMQ / STL container hybrid. Yeurk. I used a C++ STL queue to store objects, but then used a ZeroMQ PUSH/PULL socket to govern which thread could read from that queue. Reading threads were blocked on a ZeroMQ poll, and when they received a message they'd lock the queue and read an object out from it. This avoided having to serialise objects, which was handy, so it was pretty fast. This doesn't work for PUB/SUB which implies copying objects between recipients, which would need object serialisation.
ZMQ IPC is effective only in linux(using UNIX domain socket)
The performance is slower than boost::interprocess shared_memory

what dbus performance issue could prevent it from embedded system?

From my reading dbus performance should be twice slower than other messaging ipc mechanisms due to existence of a daemon.
In the discussion of the so question which Linux IPC technique to use someones mention performance issues. Do you see performance issues other than the twice slower factor? Do you see the issue that prevent dbus from being used in embedded system?
To my understanding if dbus is intended for small messages. If large amount of data need to be passed around, one of the solution is to put the data into shared memory or a pile, and then use dbus to notify. Other ipc mechanisms according to the so discussion being in consideration are: Signals, Anonymous Pipes, Named Pipes or FIFOs, SysV Message Queues, POSIX Message Queues, SysV Shared memory, POSIX Shared memory, SysV semaphores, POSIX semaphores, FUTEX locks, File-backed and anonymous shared memory using mmap, UNIX Domain Sockets, Netlink Sockets, Network Sockets, Inotify mechanisms, FUSE subsystem, D-Bus subsystem.
I should mention another so question which lists the requirements (though it is apache centered):
packet/message oriented
ability to handle both point-to-point and one-to-many communication
no hierarchy, there's no server and client
if one endpoint crashes, the others must be notified
good support from existing Linux distros
existence of a "bind" for Apache, for the purpose of creating dynamic pages -- this is too specific though, it can be ignored in a general embedded dbus usage discussion
Yet another so question about performance mentions techniques to improve the performance. With all this being taken care of I guess there should be less issue or drawback when dbus is used in an embedded system.
I don't think there is any real-and-big performance issue.
Did some profiling:
On an arm926ejs 200MHz processor, a method call and reply with two uint32 arguments consumes anywhere between 0 to 15 ms. average 6 ms.
Changed the 2nd parameter to an array of 1000 bytes. If use the iteration api to pack and unpack the 2nd parameter, it takes about 18 ms.
The same 2nd parameter of an array of 1000 bytes. If use the fixed-length api to pack and unpack the 2nd parameter, it takes about 8 ms.
As a comparison, use the SysV msgq passing a message to another process and getting a reply. It is about 10 ms too, though without optimizing the code and repeating the test for a large number of samples.
In summary, the profiling does not show a performance issue.
To support this conclusion, there is a performance related page on dbus page, which specifies only the double-context-switching because with dbus it needs to pass the message to the daemon then to the destination.
Edit: If you send messages directly bypassing the daemon, the performance would double.
Well, the Genivi alliance, targeting the automotive industry, implemented and supports CommonAPI, which works on top of DBUS, as IPC mechanism for cars' head-units.

How do I rival the speed of xcopy?

Is there an open-source project or best-practices guide shows the fastest way to copy files around a local machine, lan, san, and wan, that can rival the speed of the built-in xcopy of windows7 (or 8) or windows explorer copy?
To be blunt, not all file IO is created equal. There are different overheads in certain protocols and techniques. Some libraries don't take advantage of asynchronous operations or taking advantage of the line speed of the hardware.
I'm taking inventory of the large data transfers we use and trying to rate the effectiveness of our client applications and the applications from external vendors. Certain server applications are the worst offenders (java-based being the worst of the worst).
I'm limiting the scope of this research to SMB 2 and 3 (cifs on windows7 and 8).
Is there a disadvantage in speed in using POSIX libraries. (fread, fopen, fseek, etc)
Is there any advantage to using win32 calls (CopyFile2, ReadFileEx)
xcopy actually is not the fastest way to copy files, especially across disks or across a local network. There's a commercial product called TeraCopy that is much faster. It's closed-source so I don't know entirely how it works but one of the main differences is that instead of using a single loop to read a chunk of data to a memory buffer and then write that buffer to the new location, it uses two threads and a producer/consumer queue.
The producer reads chunks of the source file and puts them into a queue. The consumer reads from the queue and writes to the target. The advantage here is that reading and writing can be done concurrently. You do need to be careful though and have the producer keep an eye on the queue size and not make the queue too big to use up too much memory--usually reading will be faster than writing, but that also depends on the source and destination locations.

Efficient Overlapped I/O for a socket server

Which of these two different models would be more efficient (consider thrashing, utilization of processor cache, overall desgn, everything, etc)?
1 IOCP and spinning up X threads (where X is the number of processors the computer has). This would mean that my "server" would only have 1 IOCP (queue) for all requests and X Threads to serve/handle them. I have read many articles discussing the effeciency of this design. With this model I would have 1 listener that would also be associated to the IOCP. Lets assume that I could figure out how to keep the packets/requests synchronized.
X IOCP (where X is the number of processors the computer has) and each IOCP has 1 thread. This would mean that each Processor has its own queue and 1 thread to serve/handle them. With this model I would have a separate Listener (not using IOCP) that would handle incomming connections and would assign the SOCKET to the proper IOCP (one of the X that were created). Lets assume that I could figure out the Load Balancing.
Using an overly simplified analogy for the two designs (a bank):
One line with several cashiers to hand the transactions. Each person is in the same line and each cashier takes the next available person in line.
Each cashier has their own line and the people are "placed" into one of those lines
Between these two designs, which one is more efficient. In each model the Overlapped I/O structures would be using VirtualAlloc with MEM_COMMIT (as opposed to "new") so the swap-file should not be an issue (no paging). Based on how it has been described to me, using VirtualAlloc with MEM_COMMIT, the memory is reserved and is not paged out. This would allow the SOCKETS to write the incomming data right to my buffers without going through intermediate layers. So I don't think thrashing should be a factor but I might be wrong.
Someone was telling me that #2 would be more efficient but I have not heard of this model. Thanks in advance for your comments!
I assume that for #2 you plan to manually associate your sockets with an IOCP that you decide is 'best' based on some measure of 'goodness' at the time the socket is accepted? And that somehow this measure of 'goodness' will persist for the life of the socket?
With IOCP used the 'standard' way, i.e. your option number 1, the kernel works out how best to use the threads you have and allows more to run if any of them block. With your method, assuming you somehow work out how to distribute the work, you are going to end up with more threads running than with option 1.
Your #2 option also prevents you from using AcceptEx() for overlapped accepts and this is more efficient than using a normal accept loop as you remove a thread (and the resulting context switching and potential contention) from the scene.
Your analogy breaks down; it's actually more a case of either having 1 queue with X bank tellers where you join the queue and know that you'll be seen in an efficient order as opposed to each teller having their own queue and you having to guess that the queue you join doesn't contain a whole bunch of people who want to open new accounts and the one next to you contains a whole bunch of people who only want to do some paying in. The single queue ensures that you get handled efficiently.
I think you're confused about MEM_COMMIT. It doesn't mean that the memory isn't in the paging file and wont be paged. The usual reason for using VirtualAlloc for overlapped buffers is to ensure alignment on page boundaries and so reduce the number of pages that are locked for I/O (a page sized buffer can be allocated on a page boundary and so only take one page rather than happening to span two due to the memory manager deciding to use a block that doesn't start on a page boundary).
In general I think you're attempting to optimise something way ahead of schedule. Get an efficient server working using IOCP the normal way first and then profile it. I seriously doubt that you'll even need to worry about building your #2 version ... Likewise, use new to allocate your buffers to start with and then switch to the added complexity of VirtualAlloc() when you find that you server fails due to ENOBUFS and you're sure that's caused by the I/O locked page limit and not lack of non-paged pool (you do realise that you have to allocate in 'allocation granularity' sized chunks for VirtualAlloc()?).
Anyway, I have a free IOCP server framework that's available here: http://www.serverframework.com/products---the-free-framework.html which might help you get started.
Edited: The complex version that you suggest could be useful in some NUMA architectures where you use NIC teaming to have the switch spit your traffic across multiple NICs, bind each NIC to a different physical processor and then bind your IOCP threads to the same processor. You then allocate memory from that NUMA node and effectively have your network switch load balance your connections across your NUMA nodes. I'd still suggest that it's better, IMHO, to get a working server which you can profile using the "normal" method of using IOCP first and only once you know that cross NUMA node issues are actually affecting your performance move towards the more complex architecture...
Queuing theory tells us that a single queue has better characteristics than multiple queues. You could possibly get around this with work-stealing.
The multiple queues method should have better cache behavior. Whether it is significantly better depends on how many received packets are associated with a single transaction. If a request fits in a single incoming packet, then it'll be associated to a single thread even with the single IOCP approach.

Why does multithreaded file transfer improve performance?

RichCopy, a better-than-robocopy-with-GUI tool from Microsoft, seems to be the current tool of choice for copying files. One of it's main features, hightlighted in the TechNet article presenting the tool, is that it copies multiple files in parallel. In its default setting, three files are copied simultaneously, which you can see nicely in the GUI: [Progress: xx% of file A, yy% of file B, ...]. There are a lot of blog entries around praising this tool and claiming that this speeds up the copying process.
My question is: Why does this technique improve performance? As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network. My assumption would be that copying multiple files at once makes the whole process slower, since the HDD needs to jump back and forth between different files rather than just sequentially streaming one file. Since RichCopy is faster, there must be some mistake in my assumptions...
The tool is making use improvements in hardware which can optimise multiple read and write requests much better.
When copying one file at a time the hardware isn't going to know that the block of data that currently is passing under the read head (or near by) will be needed of a subsquent read since the software hasn't queued that request yet.
A single file copy these days is not very taxing task for modern disk sub-systems. By giving these hardware systems more work to do at once the tool is leveraging its improved optimising features.
A naive "copy multiple files" application will copy one file, then wait for that to complete before copying the next one.
This will mean that an individual file CANNOT be copied faster than the network latency, even if it is empty (0 bytes). Because it probably does several file server calls, (open,write,close), this may be several x the latency.
To efficiently copy files, you want to have a server and client which use a sane protocol which has pipelining; that's to say - the client does NOT wait for the first file to be saved before sending the next, and indeed, several or many files may be "on the wire" at once.
Of course to do that would require a custom server not a SMB (or similar) file server. For example, rsync does this and is very good at copying large numbers of files despite being single threaded.
So my guess is that the multithreading helps because it is a work-around for the fact that the server doesn't support pipelining on a single session.
A single-threaded implementation which used a sensible protocol would be best in my opinion.
It's a network tool, so the bottleneck is the network, not the HDD. Up to a (low) point you can get more throughput out of a TCP link by using a few connections in parallel. This (a) parallelizes the TCP handshakes; (b) can make better use of the bandwidth-delay product if that is high; and (c) doesn't make one arbitrarily slow connection the critical path if for some reason it encounters a high RTT or failure rate.
Another way to do (b) is to use an enormous TCP socket receive buffer but that's not always convenient.
Several of the other answers about HDD are incorrect. Practically any HDD will do some read-ahead on the assumption of sequential access, and any intelligent OS cache will also do that.
My gues is that the hdd read write heads spend most of their time idle and wait for the correct memory block of the disk to apear under them, the more memory being copied means less time in idle and most modern disk schedulers should take care of the jumping (for a low number of files/fragments)
As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network.
I think those assumptions are overly simplistic.
First, while LANs run at 100Mb / 1Gbit. Long haul networks have a maximum data rate that is less than the max rate of the slowest link.
Second, the effective throughput of TCP/IP stream over the internet is often dominated by the time taken to round-trip messages and acknowledgments. For example, I have a 8+Mbit link, but my data rate on downloads is rarely above 1-2Mbits per second when I'm downloading from the USA. So if you can run multiple streams in parallel one stream can be waiting for an acknowledgment while another is pumping packets. (But if you try to send too much, you start getting congestion, timeouts, back-off and lower overall transfer rates.)
Finally, operating systems are good at doing a variety of I/O tasks in parallel with other work. If you are downloading 2 or more files in parallel, the O/S may be reading / processing network packets for one download and writing to disc for another one ... at the same time.
Over long distances, networks can write much faster than they can read. With multithreading, having additional "readers" means the data can be transmitted more efficiently and not bogged down in buffers.

Resources