Using ZMQ for bidirectional inter-thread communication - zeromq

I am new to ZeroMQ. I have spent the last couple of months reading the documentation and experimenting with the library. I am currently developing a multi-threaded c++ application and want to use ZeroMQ instead of mutexes to exchange data between my main thread and one of its child.
The child thread is handling the communication with an external application. Therefore, I will need to queue/sockets between the main thread and its child. One for outgoing messages and one for incoming messages.
Which zmq socket should I use in order to achieve this.
Thanks in advance

By moving from using shared memory and mutexes to using ZeroMQ, you are entering the realm of Actor model programming.
This, in my opinion, is a fairly good thing. However, there are some things to be aware of.
The only reason mutexes are no longer needed is because you are copying data, not sharing it. The 'cost' is that copying a lot of data takes a lot longer than locking a mutex that points to shared data. So you can end up with a nice looking Actor model program that runs like a dog in comparison to an equivalent program that uses shared memory / mutexes.
A caveat is that on complicated architectures like Intel Xeons with multiple CPUs, accessing shared memory can, conceivably, take just as long as copying it. This is because this may (depending on how lucky you've been) mean transactions across the QPI bus. Actor model programming is ideal for NUMA hardware architectures. Modern Intel and AMD architectures are, partially/fundamentally, NUMA, but the protocols they run over QPI / Hypertransport "fake" an SMP environment.
I would avoid ZMQ_PAIR sockets wherever practicable. They don't work across network connections. This means that if, for any reason, your application needs to scale across multiple computers you have to re-write your code. However, if you use different socket types from the very beginning, a scale-up of your application is nothing more than a matter of redeploying your code, not changing it. FYI nanomsg PAIRs do not have this restriction.
Don't for one moment assume that Actor model programming is going to solve all your problems. It brings in a whole suite of problems all of it's own. You can still deadlock, livelock, spinlock, etc. The problem with Actor model programmes is that these problems can be lurking in your code for years and never happen, until one day the network is just a little bit busier and -bam- your program stops running...
However, there is a development of Actor model programming called "Communicating Sequential Processes". This doesn't solve those problems, but if you've written your program with these problems they are guaranteed to happen every single time. So you discover the problem during development and testing, not five years later. There's also a process calculi for it, i.e. you can algebraically prove that your design is problem free before you ever write a single line of code. ZeroMQ is not CSP. Interestingly CSP is making something of a comeback - the Rust and Go languages both do CSP. However, they do not do CSP across network connections - it's all in-process stuff. Erlang does CSP too, and AFAIK does it across network connections.
Assuming you've read all that about CSP and are still going to use ZeroMQ, think carefully about what it is you are planning on sending across the ZeroMQ sockets. If it's all within one program on the same machine, then sending copies of, for example, arrays of integers is fine. They'll still be interpretable as integers at the receiving end. However, if you have aspirations to send data through ZMQ sockets to another computer it's well worth considering some sort of serialisation technology. ZeroMQ delivers messages. Why not make those messages the byte stream from an object serialiser? Then you can guarantee that the received message will, after de-serialisation, mean something appropriate at the receiving end, instead of having to solve problems with endianness, etc.
Favourite serialisers for me include Google Protocol Buffers. It is language / operating system agnostic, giving lots of options for a heterogeneous system. ASN.1 is another really good option, it can be got for most of the important languages, and it has a rich set of wire formats (including XML and, now/soon, JSON, which gives some interesting inter-op options), and does Constraints (something Google PBufs don't do), but does tend to cost money if one wants really good tools for it. XML can be understood by almost anything, but is bloated. Basically it's worth picking one that doesn't tie you down to using, say, C#, or Python everywhere.
Good luck!

Related

What are down sides of using ZeroMQ for sending large messages (up to gigabytes)?

I found that people don't recommend sending large messages with ZeroMQ. But it is a real headache for me to split the data (it is somewhat twisted). Why this is not recommended is there some specific reason? Can it be overcome?
Why this is not recommended?
Resources ...
Even the best Zero-Copy implementation has to have spare resources to store the payloads in several principally independent, separate locations:
|<fatMessageNo1>|
|...............|__________________________________________________________ RAM
|...............|<fatMessageNo1>|
|...............|...............|__________________Context().Queue[peerNo1] RAM
|...............|...............|<fatMessageNo1>|
|...............|...............|...............|________O/S.Buffers[L3/L2] RAM
Can it be overcome?
Sure, do not send Mastodon-sized-GB+ messages. May use any kind of an off-RAM representation thereof and send just a lightweight reference to allow a remote peer to access such an immense beast.
Many new questions added via comment:
I was concern more about something like transmission failure: what will zeromq do (will it try to retransmit automatically, will it be transparent for me etc). RAM is not so crucial - servers can have it more than enough and service that we write is not intended to have huge amount of clients at the same time. The data that I talk about is very interrelated (we have molecules/atoms info and bonds between them) so it is impossible to send a chunk of it and use it - we need it all)) – Paul 25 mins ago
You may be already aware that ZeroMQ is working under a Zen-of-Zero, where also a zero-warranty got its place.
So, a ZeroMQ dispatched message will either be delivered "through" error-free, or not delivered at all. This is a great pain-saver, as your code will receive only a fully-protected content atomically, so no tortured trash will ever reach your target post-processing. Higher level soft-protocol handshaking allows one to remain in control, enabling mitigations of non-delivered cases from higher levels of abstractions, so if your design apetite and deployment conditions permit, one can harness a brute force and send whatever-[TB]-BLOBs, at one's own risk of blocked both local and infrastructure resources, if others permit and don't mind ( ... but never on my advice :o) )
Error-recovery self-healing - from lost-connection(s) and similar real-life issues - is handled if configuration, resources and timeouts permit, so a lot of troubles with keeping L1/L2/L3-ISO-OSI layers issues are efficiently hidden from user-apps programmers.

Boost.MPI vs Boost.Asio

Good day!
What difference between these libraries?
I read MPI's docs and have small experience with asio. For me it's different
implementations of network communication and no more.
But each of them introduces different abstractions ( I'm not sure about same level
of these abstractions ) which leads to different application design.
When I should use one library or another? What I must to know for choosing right
decision in each separate situation?
Yes, Asio is good for several nodes (and very generic framework in general), but why MPI is less better for such tasks? I don't think that dependency on MPI C library is restrictive or MPI is hard to understand and what about scalability? With Asio we can implement things like broadcasting and others and from another hand MPI doesn't forbid to write simple network applications. Is it conceptually difficult to rewrite Asio-specific logic with MPI if needed?
What about socket-like communications: if it's mandatory, we can encapsulate such one in module on Asio or any other framework and still use MPI for other communications.
For me sokets and MPI standart are different network services and it's not clear what is fundamental in real world, where distance from simple client-server pair to some medium computations is one step. Also I don't think that MPI has notable overhead in comparison with Asio.
Maybe it's bad question and all we need it's something like ICE (Internet Communications Engine)? Different languages support and again (as assures ZeroC) great performance.
And, of course, I never seen in any documentation topic like 'don't use this library for it!'.
I simply can't take such disunity: in one case it's sockets, in another - asynchronous messages and finally heavy middleware platform. Where is clarity in lifecycle of development? Maybe it's not fair question, but for starting to reduce this zoo we need some point.
Each library solves different problems, they don't really overlap. It also depends what you are trying to solve, and the communication patterns of your application. Use Boost.MPI for scalability, such as scaling to thousands, or tens of thousands of nodes. Depending on the underlying network architecture, MPI also excels at collective operations: gather, scatter, broadcast, etc.
Use Boost.Asio for a socket abstraction layer if you only need a handful of nodes, such as a single server and some clients. I'd suggest using Boost.Asio if you aren't already using an MPI distribution in some fashion.
I haven't used both of them, but Boost.ASIO is more an abstraction layer for networking on a low level, whereas Boost.MPI implements the MPI standard which let's you create distributed computing systems.
So if you need some, say, socket-like communication, I'd go with ASIO. If you want to do distributed computing and maybe even interoperate with MPI programs written in other languages/for other platforms, go with Boost.MPI.

Faking a Single Address Space

I have a large scientific computing task that parallelizes very well with SMP, but at too fine grained a level to be easily parallelized via explicit message passing. I'd like to parallelize it across address spaces and physical machines. Is it feasible to create a scheduler that would parallelize already multithreaded code across multiple physical computers under the following conditions:
The code is already multithreaded and can scale pretty well on SMP configurations.
The fact that not all of the threads are running in the same address space or on the same physical machine must be transparent to the program, even if this comes at a significant performance penalty in some use cases.
You may assume that all of the physical machines involved are running operating systems and CPU architectures that are binary compatible.
Things like locks and atomic operations may be slow (having network latency to deal with and all) but must "just work".
Edits:
I only care about throughput, not latency.
I'm using the D programming language, and I'm almost sure there's no canned solution. I'm more interested in whether this is feasible in principle than in a particular canned solution.
My first thought is to use Apache Hadoop. It provides distributed storage and distributed computing. You can synchronize across processes by using files as locks.
It sounds like you want something like SCRAMNet, although that requires custom hardware. I don't know if there is a software-only solution. Also, it's likely that even if you got it working, you'd find your networked version was actually running slower than when it was previously on a single machine. You may just have to bite the bullet and re-design your app.
Since your point 2 suggests that you can live with some performance degradation you might want to consider a hybrid approach: SMP within individual machines, message-passing between machines. I'm not familiar with D so can offer no specific advice. Further I've seen mixed reviews of the hybrid approach for OpenMP+MPI, but it might suit you and your application.
EDIT: You might want to Google around for 'partitioned global address space' which seems to describe your desired approach quite accurately. As before, I have no advice on using D for this.

What to avoid for performance reasons in multithreaded code?

I'm currently reviewing/refactoring a multithreaded application which is supposed to be multithreaded in order to be able to use all the available cores and theoretically deliver a better / superior performance (superior is the commercial term for better :P)
What are the things I should be aware when programming multithreaded applications?
I mean things that will greatly impact performance, maybe even to the point where you don't gain anything with multithreading at all but lose a lot by design complexity. What are the big red flags for multithreading applications?
Should I start questioning the locks and looking to a lock-free strategy or are there other points more important that should light a warning light?
Edit: The kind of answers I'd like are similar to the answer by Janusz, I want red warnings to look up in code, I know the application doesn't perform as well as it should, I need to know where to start looking, what should worry me and where should I put my efforts. I know it's kind of a general question but I can't post the entire program and if I could choose one section of code then I wouldn't be needing to ask in the first place.
I'm using Delphi 7, although the application will be ported / remake in .NET (c#) for the next year so I'd rather hear comments that are applicable as a general practice, and if they must be specific to either one of those languages
One thing to definitely avoid is lots of write access to the same cache lines from threads.
For example: If you use a counter variable to count the number of items processed by all threads, this will really hurt performance because the CPU cache lines have to synchronize whenever the other CPU writes to the variable.
One thing that decreases performance is having two threads with much hard drive access. The hard drive would jump from providing data for one thread to the other and both threads would wait for the disk all the time.
Something to keep in mind when locking: lock for as short a time as possible. For example, instead of this:
lock(syncObject)
{
bool value = askSomeSharedResourceForSomeValue();
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
}
Do this (if possible):
bool value = false;
lock(syncObject)
{
value = askSomeSharedResourceForSomeValue();
}
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
Of course, this example only works if DoSomethingIfTrue() and DoSomethingIfFalse() don't require synchronization, but it illustrates this point: locking for as short a time as possible, while maybe not always improving your performance, will improve the safety of your code in that it reduces surface area for synchronization problems.
And in certain cases, it will improve performance. Staying locked for long lengths of time means that other threads waiting for access to some resource are going to be waiting longer.
More threads then there are cores, typically means that the program is not performing optimally.
So a program which spawns loads of threads usually is not designed in the best fashion. A good example of this practice are the classic Socket examples where every incoming connection got it's own thread to handle of the connection. It is a very non scalable way to do things. The more threads there are, the more time the OS will have to use for context switching between threads.
You should first be familiar with Amdahl's law.
If you are using Java, I recommend the book Java Concurrency in Practice; however, most of its help is specific to the Java language (Java 5 or later).
In general, reducing the amount of shared memory increases the amount of parallelism possible, and for performance that should be a major consideration.
Threading with GUI's is another thing to be aware of, but it looks like it is not relevant for this particular problem.
What kills performance is when two or more threads share the same resources. This could be an object that both use, or a file that both use, a network both use or a processor that both use. You cannot avoid these dependencies on shared resources but if possible, try to avoid sharing resources.
Run-time profilers may not work well with a multi-threaded application. Still, anything that makes a single-threaded application slow will also make a multi-threaded application slow. It may be an idea to run your application as a single-threaded application, and use a profiler, to find out where its performance hotspots (bottlenecks) are.
When it's running as a multi-threaded aplication, you can use the system's performance-monitoring tool to see whether locks are a problem. Assuming that your threads would lock instead of busy-wait, then having 100% CPU for several threads is a sign that locking isn't a problem. Conversely, something that looks like 50% total CPU utilitization on a dual-processor machine is a sign that only one thread is running, and so maybe your locking is a problem that's preventing more than one concurrent thread (when counting the number of CPUs in your machine, beware multi-core and hyperthreading).
Locks aren't only in your code but also in the APIs you use: e.g. the heap manager (whenever you allocate and delete memory), maybe in your logger implementation, maybe in some of the O/S APIs, etc.
Should I start questioning the locks and looking to a lock-free strategy
I always question the locks, but have never used a lock-free strategy; instead my ambition is to use locks where necessary, so that it's always threadsafe but will never deadlock, and to ensure that locks are acquired for a tiny amount of time (e.g. for no more than the amount of time it takes to push or pop a pointer on a thread-safe queue), so that the maximum amount of time that a thread may be blocked is insignificant compared to the time it spends doing useful work.
You don't mention the language you're using, so I'll make a general statement on locking. Locking is fairly expensive, especially the naive locking that is native to many languages. In many cases you are reading a shared variable (as opposed to writing). Reading is threadsafe as long as it is not taking place simultaneously with a write. However, you still have to lock it down. The most naive form of this locking is to treat the read and the write as the same type of operation, restricting access to the shared variable from other reads as well as writes. A read/writer lock can dramatically improve performance. One writer, infinite readers. On an app I've worked on, I saw a 35% performance improvement when switching to this construct. If you are working in .NET, the correct lock is the ReaderWriterLockSlim.
I recommend looking into running multiple processes rather than multiple threads within the same process, if it is a server application.
The benefit of dividing the work between several processes on one machine is that it is easy to increase the number of servers when more performance is needed than a single server can deliver.
You also reduce the risks involved with complex multithreaded applications where deadlocks, bottlenecks etc reduce the total performance.
There are commercial frameworks that simplifies server software development when it comes to load balancing and distributed queue processing, but developing your own load sharing infrastructure is not that complicated compared with what you will encounter in general in a multi-threaded application.
I'm using Delphi 7
You might be using COM objects, then, explicitly or implicitly; if you are, COM objects have their own complications and restrictions on threading: Processes, Threads, and Apartments.
You should first get a tool to monitor threads specific to your language, framework and IDE. Your own logger might do fine too (Resume Time, Sleep Time + Duration). From there you can check for bad performing threads that don't execute much or are waiting too long for something to happen, you might want to make the event they are waiting for to occur as early as possible.
As you want to use both cores you should check the usage of the cores with a tool that can graph the processor usage on both cores for your application only, or just make sure your computer is as idle as possible.
Besides that you should profile your application just to make sure that the things performed within the threads are efficient, but watch out for premature optimization. No sense to optimize your multiprocessing if the threads themselves are performing bad.
Looking for a lock-free strategy can help a lot, but it is not always possible to get your application to perform in a lock-free way.
Threads don't equal performance, always.
Things are a lot better in certain operating systems as opposed to others, but if you can have something sleep or relinquish its time until it's signaled...or not start a new process for virtually everything, you're saving yourself from bogging the application down in context switching.

MPI for multicore?

With the recent buzz on multicore programming is anyone exploring the possibilities of using MPI ?
I've used MPI extensively on large clusters with multi-core nodes. I'm not sure if it's the right thing for a single multi-core box, but if you anticipate that your code may one day scale larger than a single chip, you might consider implementing it in MPI. Right now, nothing scales larger than MPI. I'm not sure where the posters who mention unacceptable overheads are coming from, but I've tried to give an overview of the relevant tradeoffs below. Read on for more.
MPI is the de-facto standard for large-scale scientific computation and it's in wide use on multicore machines already. It is very fast. Take a look at the most recent Top 500 list. The top machines on that list have, in some cases, hundreds of thousands of processors, with multi-socket dual- and quad-core nodes. Many of these machines have very fast custom networks (Torus, Mesh, Tree, etc) and optimized MPI implementations that are aware of the hardware.
If you want to use MPI with a single-chip multi-core machine, it will work fine. In fact, recent versions of Mac OS X come with OpenMPI pre-installed, and you can download an install OpenMPI pretty painlessly on an ordinary multi-core Linux machine. OpenMPI is in use at Los Alamos on most of their systems. Livermore uses mvapich on their Linux clusters. What you should keep in mind before diving in is that MPI was designed for solving large-scale scientific problems on distributed-memory systems. The multi-core boxes you are dealing with probably have shared memory.
OpenMPI and other implementations use shared memory for local message passing by default, so you don't have to worry about network overhead when you're passing messages to local processes. It's pretty transparent, and I'm not sure where other posters are getting their concerns about high overhead. The caveat is that MPI is not the easiest thing you could use to get parallelism on a single multi-core box. In MPI, all the message passing is explicit. It has been called the "assembly language" of parallel programming for this reason. Explicit communication between processes isn't easy if you're not an experienced HPC person, and there are other paradigms more suited for shared memory (UPC, OpenMP, and nice languages like Erlang to name a few) that you might try first.
My advice is to go with MPI if you anticipate writing a parallel application that may need more than a single machine to solve. You'll be able to test and run fine with a regular multi-core box, and migrating to a cluster will be pretty painless once you get it working there. If you are writing an application that will only ever need a single machine, try something else. There are easier ways to exploit that kind of parallelism.
Finally, if you are feeling really adventurous, try MPI in conjunction with threads, OpenMP, or some other local shared-memory paradigm. You can use MPI for the distributed message passing and something else for on-node parallelism. This is where big machines are going; future machines with hundreds of thousands of processors or more are expected to have MPI implementations that scale to all nodes but not all cores, and HPC people will be forced to build hybrid applications. This isn't for the faint of heart, and there's a lot of work to be done before there's an accepted paradigm in this space.
I would have to agree with tgamblin. You'll probably have to roll your sleeves up and really dig into the code to use MPI, explicitly handling the organization of the message-passing yourself. If this is the sort of thing you like or don't mind doing, I would expect that MPI would work just as well on multicore machines as it would on a distributed cluster.
Speaking from personal experience... I coded up some C code in graduate school to do some large scale modeling of electrophysiologic models on a cluster where each node was itself a multicore machine. Therefore, there were a couple of different parallel methods I thought of to tackle the problem.
1) I could use MPI alone, treating every processor as it's own "node" even though some of them are grouped together on the same machine.
2) I could use MPI to handle data moving between multicore nodes, and then use threading (POSIX threads) within each multicore machine, where processors share memory.
For the specific mathematical problem I was working on, I tested two formulations first on a single multicore machine: one using MPI and one using POSIX threads. As it turned out, the MPI implementation was much more efficient, giving a speed-up of close to 2 for a dual-core machine as opposed to 1.3-1.4 for the threaded implementation. For the MPI code, I was able to organize operations so that processors were rarely idle, staying busy while messages were passed between them and masking much of the delay from transferring data. With the threaded code, I ended up with a lot of mutex bottlenecks that forced threads to often sit and wait while other threads completed their computations. Keeping the computational load balanced between threads didn't seem to help this fact.
This may have been specific to just the models I was working on, and the effectiveness of threading vs. MPI would likely vary greatly for other types of parallel problems. Nevertheless, I would disagree that MPI has an unwieldy overhead.
No, in my opinion it is unsuitable for most processing you would do on a multicore system. The overhead is too high, the objects you pass around must be deeply cloned, and passing large objects graphs around to then run a very small computation is very inefficient. It is really meant for sharing data between separate processes, most often running in separate memory spaces, and most often running long computations.
A multicore processor is a shared memory machine, so there are much more efficient ways to do parallel processing, that do not involve copying objects and where most of the threads run for a very small time. For example, think of a multithreaded Quicksort. The overhead of allocating memory and copying the data to a thread before it can be partioned will be much slower with MPI and an unlimited number of processors than Quicksort running on a single processor.
As an example, in Java, I would use a BlockingQueue (a shared memory construct), to pass object references between threads, with very little overhead.
Not that it does not have its place, see for example the Google search cluster that uses message passing. But it's probably not the problem you are trying to solve.
MPI is not inefficient. You need to break the problem down into chunks and pass the chunks around and reorganize when the result is finished per chunk. No one in the right mind would pass around the whole object via MPI when only a portion of the problem is being worked on per thread. Its not the inefficiency of the interface or design pattern thats the inefficiency of the programmers knowledge of how to break up a problem.
When you use a locking mechanism the overhead on the mutex does not scale well. this is due to the fact that the underlining runqueue does not know when you are going to lock the thread next. You will perform more kernel level thrashing using mutex's than a message passing design pattern.
MPI has a very large amount of overhead, primarily to handle inter-process communication and heterogeneous systems. I've used it in cases where a small amount of data is being passed around, and where the ratio of computation to data is large.
This is not the typical usage scenario for most consumer or business tasks, and in any case, as a previous reply mentioned, on a shared memory architecture like a multicore machine, there are vastly faster ways to handle it, such as memory pointers.
If you had some sort of problem with the properties describe above, and you want to be able to spread the job around to other machines, which must be on the same highspeed network as yourself, then maybe MPI could make sense. I have a hard time imagining such a scenario though.
I personally have taken up Erlang( and i like to so far). The messages based approach seem to fit most of the problem and i think that is going to be one of the key item for multi core programming. I never knew about the overhead of MPI and thanks for pointing it out
You have to decide if you want low level threading or high level threading. If you want low level then use pThread. You have to be careful that you don't introduce race conditions and make threading performance work against you.
I have used some OSS packages for (C and C++) that are scalable and optimize the task scheduling. TBB (threading building blocks) and Cilk Plus are good and easy to code and get applications of the ground. I also believe they are flexible enough integrate other thread technologies into it at a later point if needed (OpenMP etc.)
www.threadingbuildingblocks.org
www.cilkplus.org

Resources