what dbus performance issue could prevent it from embedded system? - performance

From my reading dbus performance should be twice slower than other messaging ipc mechanisms due to existence of a daemon.
In the discussion of the so question which Linux IPC technique to use someones mention performance issues. Do you see performance issues other than the twice slower factor? Do you see the issue that prevent dbus from being used in embedded system?
To my understanding if dbus is intended for small messages. If large amount of data need to be passed around, one of the solution is to put the data into shared memory or a pile, and then use dbus to notify. Other ipc mechanisms according to the so discussion being in consideration are: Signals, Anonymous Pipes, Named Pipes or FIFOs, SysV Message Queues, POSIX Message Queues, SysV Shared memory, POSIX Shared memory, SysV semaphores, POSIX semaphores, FUTEX locks, File-backed and anonymous shared memory using mmap, UNIX Domain Sockets, Netlink Sockets, Network Sockets, Inotify mechanisms, FUSE subsystem, D-Bus subsystem.
I should mention another so question which lists the requirements (though it is apache centered):
packet/message oriented
ability to handle both point-to-point and one-to-many communication
no hierarchy, there's no server and client
if one endpoint crashes, the others must be notified
good support from existing Linux distros
existence of a "bind" for Apache, for the purpose of creating dynamic pages -- this is too specific though, it can be ignored in a general embedded dbus usage discussion
Yet another so question about performance mentions techniques to improve the performance. With all this being taken care of I guess there should be less issue or drawback when dbus is used in an embedded system.

I don't think there is any real-and-big performance issue.
Did some profiling:
On an arm926ejs 200MHz processor, a method call and reply with two uint32 arguments consumes anywhere between 0 to 15 ms. average 6 ms.
Changed the 2nd parameter to an array of 1000 bytes. If use the iteration api to pack and unpack the 2nd parameter, it takes about 18 ms.
The same 2nd parameter of an array of 1000 bytes. If use the fixed-length api to pack and unpack the 2nd parameter, it takes about 8 ms.
As a comparison, use the SysV msgq passing a message to another process and getting a reply. It is about 10 ms too, though without optimizing the code and repeating the test for a large number of samples.
In summary, the profiling does not show a performance issue.
To support this conclusion, there is a performance related page on dbus page, which specifies only the double-context-switching because with dbus it needs to pass the message to the daemon then to the destination.
Edit: If you send messages directly bypassing the daemon, the performance would double.

Well, the Genivi alliance, targeting the automotive industry, implemented and supports CommonAPI, which works on top of DBUS, as IPC mechanism for cars' head-units.

Related

Which is faster for IPC sharing of objects -ZeroMQ or Boost::interprocess?

I am contemplating inter-process sharing of custom objects. My current implementation uses ZeroMQ where the objects are packed into a message and sent from process A to process B.
I am wondering whether it would be faster instead to have a concurrent container implemented using boost::interprocess (where process A will insert into the container and process B will retrieve from it). Not sure if this will be faster than having to serialise the object in process A and then de-serialising it in process B.
Just wondering if anyone has done benchmarking? Is it conceptually right to compare the two?
In principle, ZeroMq should be slower, because the metaphor it's using is the passing of messages. These kinds of libraries are not intended for sharing regions of memory, in place, and for different processes to be able to modify them concurrently.
Specifically, you mentioned "packing". When sharing memory regions, you can - ideally - avoid any packing and just work on data as-is (of course, with the care necessary in concurrent use of the same data structures, using offsets instead of pointers etc.)
Also note that even when sharing is a one-directional back-and-forth (i.e. only one process at a time accesses any of the data), ZeroMq can only match the use of IPC shared memory if it supports zero-copying all the way down. This is not clear to me from the FAQ page on zero-copying (but may be the case anyway).
I agree with Nim, they're too different for easy comparison.
ZeroMQ has inproc which uses shared memory as a byte transport.
Boost.Interprocess seems to be mostly about having objects constructed in shared memory, accessible to multiple processes / threads. However it does have message queues, but they too are just byte transports requiring objects to be serialised, just like you have to with ZeroMQ. They're not object containers, so are more comparable to ZeroMQ but is quite a long way from what Boost.Interprocess seems to represent.
I have done a ZeroMQ / STL container hybrid. Yeurk. I used a C++ STL queue to store objects, but then used a ZeroMQ PUSH/PULL socket to govern which thread could read from that queue. Reading threads were blocked on a ZeroMQ poll, and when they received a message they'd lock the queue and read an object out from it. This avoided having to serialise objects, which was handy, so it was pretty fast. This doesn't work for PUB/SUB which implies copying objects between recipients, which would need object serialisation.
ZMQ IPC is effective only in linux(using UNIX domain socket)
The performance is slower than boost::interprocess shared_memory

Using ZMQ for bidirectional inter-thread communication

I am new to ZeroMQ. I have spent the last couple of months reading the documentation and experimenting with the library. I am currently developing a multi-threaded c++ application and want to use ZeroMQ instead of mutexes to exchange data between my main thread and one of its child.
The child thread is handling the communication with an external application. Therefore, I will need to queue/sockets between the main thread and its child. One for outgoing messages and one for incoming messages.
Which zmq socket should I use in order to achieve this.
Thanks in advance
By moving from using shared memory and mutexes to using ZeroMQ, you are entering the realm of Actor model programming.
This, in my opinion, is a fairly good thing. However, there are some things to be aware of.
The only reason mutexes are no longer needed is because you are copying data, not sharing it. The 'cost' is that copying a lot of data takes a lot longer than locking a mutex that points to shared data. So you can end up with a nice looking Actor model program that runs like a dog in comparison to an equivalent program that uses shared memory / mutexes.
A caveat is that on complicated architectures like Intel Xeons with multiple CPUs, accessing shared memory can, conceivably, take just as long as copying it. This is because this may (depending on how lucky you've been) mean transactions across the QPI bus. Actor model programming is ideal for NUMA hardware architectures. Modern Intel and AMD architectures are, partially/fundamentally, NUMA, but the protocols they run over QPI / Hypertransport "fake" an SMP environment.
I would avoid ZMQ_PAIR sockets wherever practicable. They don't work across network connections. This means that if, for any reason, your application needs to scale across multiple computers you have to re-write your code. However, if you use different socket types from the very beginning, a scale-up of your application is nothing more than a matter of redeploying your code, not changing it. FYI nanomsg PAIRs do not have this restriction.
Don't for one moment assume that Actor model programming is going to solve all your problems. It brings in a whole suite of problems all of it's own. You can still deadlock, livelock, spinlock, etc. The problem with Actor model programmes is that these problems can be lurking in your code for years and never happen, until one day the network is just a little bit busier and -bam- your program stops running...
However, there is a development of Actor model programming called "Communicating Sequential Processes". This doesn't solve those problems, but if you've written your program with these problems they are guaranteed to happen every single time. So you discover the problem during development and testing, not five years later. There's also a process calculi for it, i.e. you can algebraically prove that your design is problem free before you ever write a single line of code. ZeroMQ is not CSP. Interestingly CSP is making something of a comeback - the Rust and Go languages both do CSP. However, they do not do CSP across network connections - it's all in-process stuff. Erlang does CSP too, and AFAIK does it across network connections.
Assuming you've read all that about CSP and are still going to use ZeroMQ, think carefully about what it is you are planning on sending across the ZeroMQ sockets. If it's all within one program on the same machine, then sending copies of, for example, arrays of integers is fine. They'll still be interpretable as integers at the receiving end. However, if you have aspirations to send data through ZMQ sockets to another computer it's well worth considering some sort of serialisation technology. ZeroMQ delivers messages. Why not make those messages the byte stream from an object serialiser? Then you can guarantee that the received message will, after de-serialisation, mean something appropriate at the receiving end, instead of having to solve problems with endianness, etc.
Favourite serialisers for me include Google Protocol Buffers. It is language / operating system agnostic, giving lots of options for a heterogeneous system. ASN.1 is another really good option, it can be got for most of the important languages, and it has a rich set of wire formats (including XML and, now/soon, JSON, which gives some interesting inter-op options), and does Constraints (something Google PBufs don't do), but does tend to cost money if one wants really good tools for it. XML can be understood by almost anything, but is bloated. Basically it's worth picking one that doesn't tie you down to using, say, C#, or Python everywhere.
Good luck!

what does it mean configuring MPI for shared memory?

I have a bit of research related question.
Currently I have finished implementation of structure skeleton frame work based on MPI (specifically using openmpi 6.3). the frame work is supposed to be used on single machine.
now, I am comparing it with other previous skeleton implementations (such as scandium, fast-flow, ..)
One thing I have noticed is that the performance of my implementation is not as good as the other implementations.
I think this is because, my implementation is based on MPI (thus a two sided communication that require the match of send and receive operation)
while the other implementations I am comparing with are based on shared memory. (... but still I have no good explanation to reason out that, and it is part of my question)
There are some big difference on completion time of the two categories.
Today I am also introduced to configuration of open-mpi for shared memory here => openmpi-sm
and there come comes my question.
1st what does it means to configure MPI for shared memory? I mean while MPI processes live in their own virtual memory; what really is the flag like in the following command do?
(I thought in MPI every communication is by explicitly passing a message, no memory is shared between processes).
shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out
2nd why is the performance of MPI is so much worse with compared to other skeleton implementation developed for shared memory? At least I am also running it on one single multi-core machine.
(I suppose it is because other implementation used thread parallel programming, but I have no convincing explanation for that).
any suggestion or further discussion is very welcome.
Please let me know if I have to further clarify my question.
thank you for your time!
Open MPI is very modular. It has its own component model called Modular Component Architecture (MCA). This is where the name of the --mca parameter comes from - it is used to provide runtime values to MCA parameters, exported by the different components in the MCA.
Whenever two processes in a given communicator want to talk to each other, MCA finds suitable components, that are able to transmit messages from one process to the other. If both processes reside on the same node, Open MPI usually picks the shared memory BTL component, known as sm. If both processes reside on different nodes, Open MPI walks the available network interfaces and choses the fastest one that can connect to the other node. It puts some preferences on fast networks like InfiniBand (via the openib BTL component), but if your cluster doesn't have InfiniBand, TCP/IP is used as a fallback if the tcp BTL component is in the list of allowed BTLs.
By default you do not need to do anything special in order to enable shared memory communication. Just launch your program with mpiexec -np 16 ./a.out. What you have linked to is the shared memory part of the Open MPI FAQ which gives hints on what parameters of the sm BTL could be tweaked in order to get better performance. My experience with Open MPI shows that the default parameters are nearly optimal and work very well, even on exotic hardware like multilevel NUMA systems. Note that the default shared memory communication implementation copies the data twice - once from the send buffer to shared memory and once from shared memory to the receive buffer. A shortcut exists in the form of the KNEM kernel device, but you have to download it and compile it separately as it is not part of the standard Linux kernel. With KNEM support, Open MPI is able to perform "zero-copy" transfers between processes on the same node - the copy is done by the kernel device and it is a direct copy from the memory of the first process to the memory of the second process. This dramatically improves the transfer of large messages between processes that reside on the same node.
Another option is to completely forget about MPI and use shared memory directly. You can use the POSIX memory management interface (see here) to create a shared memory block have all processes operate on it directly. If data is stored in the shared memory, it could be beneficial as no copies would be made. But watch out for NUMA issues on modern multi-socket systems, where each socket has its own memory controller and accessing memory from remote sockets on the same board is slower. Process pinning/binding is also important - pass --bind-to-socket to mpiexec to have it pinn each MPI process to a separate CPU core.

Average performance measurements for local IPC

We are now assessing different IPC (or rather RPC) methods for our current project, which is in its very early stages. Performance is a big deal, and so we are making some measurements to aid our choice. Our processes that will be communicating will reside on the same machine.
A separate valid option is to avoid IPC altogether (by encapsulating the features of one of the processes in a .NET DLL and having the other one use it), but this is an option we would really like to avoid, as these two pieces of software are developed by two separate companies and we find it very important to maintain good "fences", which make good neighbors.
Our tests consisted of passing messages (which contain variously sized BLOBs) across process boundaries using each method. These are the figures we get (performance range correlates with message size range):
Web Service (SOAP over HTTP):
25-30 MB/s when binary data is encoded as Base64 (default)
70-100 MB/s when MTOM is utilized
.NET Remoting (BinaryFormatter over TCP): 100-115 MB/s
Control group - DLL method call + mem copy: 800-1000 MB/s
Now, we've been looking all over the place for some average performance figures for these (and other) IPC methods, including performance of raw TCP loopback sockets, but couldn't find any. Do these figures look sane? Why is the performance of these local IPC methods at least 10 times slower than copying memory? I couldn't get better results even when I used raw sockets - is the overhead of TCP that big?
Shared memory is the fastest.
A producer process can put its output into memory shared between processes and notify other processes that the shared data has been updated. On Linux you naturally put a mutex and a condition variable in that same shared memory so that other processes can wait for updates on the condition variable.
Memory-mapped files + synchronization objects is the right way to go (almost the same as shared memory, but with more control). Sockets are way too slow for local communications. Especially it sometimes happens that network drivers are slower with localhost, than over network.
Several parts of our system have been redesigned so that we don't have to pass 30MB messages around, but rather 3MB. This allowed us to choose .NET Remoting with BinaryFormatter over named pipes (IpcChannel), which gives satisfactory results.
Our contingency plan (in case we ever do need to pass 30MB messages around) is to pass protobuf-serialized messages over named pipes manually. We have determined that this also provides satisfactory results.

Win32 Overlapped I/O - Completion routines or WaitForMultipleObjects?

I'm wondering which approach is faster and why ?
While writing a Win32 server I have read a lot about the Completion Ports and the Overlapped I/O, but I have not read anything to suggest which set of API's yields the best results in the server.
Should I use completion routines, or should I use the WaitForMultipleObjects API and why ?
You suggest two methods of doing overlapped I/O and ignore the third (or I'm misunderstanding your question).
When you issue an overlapped operation, a WSARecv() for example, you can specify an OVERLAPPED structure which contains an event and you can wait for that event to be signalled to indicate the overlapped I/O has completed. This, I assume, is your WaitForMultipleObjects() approach and, as previously mentioned, this doesn't scale well as you're limited to the number of handles that you can pass to WaitForMultipleObjects().
Alternatively you can pass a completion routine which is called when completion occurs. This is known as 'alertable I/O' and requires that the thread that issued the WSARecv() call is in an 'alertable' state for the completion routine to be called. Threads can put themselves in an alertable state in several ways (calling SleepEx() or the various EX versions of the Wait functions, etc). The Richter book that I have open in front of me says "I have worked with alertable I/O quite a bit, and I'll be the first to tell you that alertable I/O is horrible and should be avoided". Enough said IMHO.
There's a third way, before issuing the call you should associate the handle that you want to do overlapped I/O on with a completion port. You then create a pool of threads which service this completion port by calling GetQueuedCompletionStatus() and looping. You issue your WSARecv() with an OVERLAPPED structure WITHOUT an event in it and when the I/O completes the completion pops out of GetQueuedCompletionStatus() on one of your I/O pool threads and can be handled there.
As previously mentioned, Vista/Server 2008 have cleaned up how IOCPs work a little and removed the problem whereby you had to make sure that the thread that issued the overlapped request continued to run until the request completed. Link to a reference to that can be found here. But this problem is easy to work around anyway; you simply marshal the WSARecv over to one of your I/O pool threads using the same IOCP that you use for completions...
Anyway, IMHO using IOCPs is the best way to do overlapped I/O. Yes, getting your head around the overlapped/async nature of the calls can take a little time at the start but it's well worth it as the system scales very well and offers a simple "fire and forget" method of dealing with overlapped operations.
If you need some sample code to get you going then I have several articles on writing IO completion port systems and a heap of free code that provides a real-world framework for high performance servers; see here.
As an aside; IMHO, you really should read "Windows Via C/C++ (PRO-Developer)" by Jeffrey Richter and Christophe Nasarre as it deals will all you need to know about overlapped I/O and most other advanced windows platform techniques and APIs.
WaitForMultipleObjects is limited to 64 handles; in a highly concurrent application this could become a limitation.
Completion ports fit better with a model of having a pool of threads all of which are capable of handling any event, and you can queue your own (non-IO based) events into the port, whereas with waits you would need to code your own mechanism.
However completion ports, and the event based programming model, are a more difficult concept to really work against.
I would not expect any significant performance difference, but in the end you can only make your own measurements to reflect your usage. Note that Vista/Server2008 made a change with completion ports that the originating thread is not now needed to complete IO operations, this may make a bigger difference (see this article by Mark Russinovich).
Table 6-3 in the book Network Programming for Microsoft Windows, 2nd Edition compares the scalability of overlapped I/O via completion ports vs. other techniques. Completion ports blow all the other I/O models out of the water when it comes to throughput, while using far fewer threads.
The difference between WaitForMultipleObjects() and I/O completion ports is that IOCP scales to thousands of objects, whereas WFMO() does not and should not be used for anything more than 64 objects (even though you could).
You can't really compare them for performance, because in the domain of < 64 objects, they will be essentially identical.
WFMO() however does a round-robin on its objects, so busy objects with low index numbers can starve objects with high index numbers. (E.g. if object 0 is going off constantly, it will starve objects 1, 2, 3, etc). This is obviously undesireable.
I wrote an IOCP library (for sockets) to solve the C10K problem and put it in the public domain. I was able on a 512mb W2K machine to get 4,000 sockets concurrently transferring data. (You can get a lot more sockets, if they're idle - a busy socket consumes more non-paged pool and that's the ultimate limit on how many sockets you can have).
http://www.45mercystreet.com/computing/libiocp/index.html
The API should give you exactly what you need.
Not sure. But I use WaitForMultipleObjects and/or WaitFoSingleObjects. It's very convenient.
Either routine works and I don't really think one is any significant faster then another.
These two approaches exists to satisfy different programming models.
WaitForMultipleObjects is there to facilitate async completion pattern (like UNIX select() function) while completion ports is more towards event driven model.
I personally think WaitForMultipleObjects() approach result in cleaner code and more thread safe.

Resources