Which pattern should be used when implementing a multi-process program with ZeroMQ? - multiprocessing

We are working on a multi-process program like chrome ipc.
The communication between parent and child process will utilize ZeroMQ. The parent will send specific messages to specific child. And the child will also transmit messages to parent at a higher frequency say 50 times per second.
I found the instruction of ZeroMQ. As the suggested, the current plan is a router socket in the parent as a separated IO thread and a dealer socket in each child process.
Is this a good design?

Q : "Is this a good design?"
That Hell depends ...
Most of my distributed-computing implementations have many ZeroMQ Formal Communication Archetype patterns co-operating at once, as the system design is built on top of the ZeroMQ messaging/signalling-plane layer. No single archetype pattern suffice. That is both granted & natural - the Zen-of-Zero is built around performance, so no single archetype pattern was designed for having a single feature more, than indeed needed and all the rest (above) is achievable by application-level composability of these toolbox primitives.
The Zen-of-Zero prefers to be laser-focused on low-latency, almost-linear scalability of as high performance as possible.
If there were any such archetype, that would match One-Size-Fits-All dedication, the add-on costs of latencies-added, degraded-performance penalties would be unavoidable, yet having negative impacts on all such use-cases, that do not need any other but the very minimal design properties that the toolbox primitives were built around, while any more complex needs still could be implemented on an as needed basis.

Related

Splitting work between multiple pollers?

My current setup for work on server side is like this -- I have a manager (with poller) which waits for incoming requests for work to do. Once something is received it creates worker (with separate poller, and separate ports/sockets) for the job, and further on worker communicates directly with client.
What I observe that when there is some intense traffic with any of the worker it disables manager somewhat -- ReceiveReady events are fired with significant delays.
NetMQ documentation states "Receiving messages with poller is slower than directly calling Receive method on the socket. When handling thousands of messages a second, or more, poller can be a bottleneck." I am so far below this limit (say 100 messages in a row) but I wonder whether having multiple pollers in single program does not clip performance even further.
I prefer having separate instances because the code is cleaner (separation of concerns), but maybe I am going against the principles of ZeroMQ? The question is -- is using multiple pollers in single program performance wise? Or in reverse -- do multiple pollers starve each other by design?
Professional system analysis may even require you to run multiple Poller() instances:
Design system based on facts and requirements, rather than to listen to some popularised opinions.
Implement performance benchmarks and measure details about actual implementation. Comparing facts against thresholds is a.k.a. a Fact-Based-Decision.
If not hunting for the last few hundreds of [ns], a typical scenario may look this way:
your core logic inside an event-responding loop is to handle several classes of ZeroMQ integrated signallin / messaging inputs/outputs, all in a principally non-blocking mode plus your design has to spend specific amount of relative-attention to each such class.
One may accept some higher inter-process latencies for a remote-keyboard ( running a CLI-interface "across" a network, while your event-loop has to meet a strict requirement not to miss any "fresh" update from a QUOTE-stream. So one has to create a light-weight Real-Time-SCHEDULER logic, that will introduce one high-priority Poller() for non-blocking ( zero-wait ), another one with ~ 5 ms test on reading "slow"-channels and another one with a 15 ms test on reading the main data-flow pipe. If you have profiled your event-handling routines not to last more than 5 ms worst case, you still can handle TAT of 25 ms and your event-loop may handle systems with a requirement to have a stable control-loop cycle of 40 Hz.
Not using a set of several "specialised" pollers will not allow one to get this level of scheduling determinism with an easily expressed core-logic to integrate in such principally stable control-loops.
Q.E.D.
I use similar design so as to drive heterogeneous distributed systems for FOREX trading, based on external AI/ML-predictors, where transaction times are kept under ~ 70 ms ( end-to-end TAT, from a QUOTE arrival to an AI/ML advised XTO order-instruction being submitted ) right due to a need to match the real-time constraints of the control-loop scheduling requirements.
Epilogue:
If the documentation says something about a poller performance, in the ranges above 1 kHz signal delivery, but does not mention anything about a duration of a signal/message handling-process, it does a poor service for the public.
The first step to take is to measure the process latencies, next, analyse the performance envelopes. All ZeroMQ tools are designed to scale, so has the application infrastructure -- so forget about any SLOC-sized examples, the bottleneck is not the poller instance, but a poor application use of the available ZeroMQ components ( given a known performance envelope was taken into account ) -- one can always increase the overall processing capacity available, with ZeroMQ we are in a distributed-systems realm from a Day 0, aren't we?
So in concisely designed + monitored + adaptively scaled systems no choking will appear.

Using ZMQ for bidirectional inter-thread communication

I am new to ZeroMQ. I have spent the last couple of months reading the documentation and experimenting with the library. I am currently developing a multi-threaded c++ application and want to use ZeroMQ instead of mutexes to exchange data between my main thread and one of its child.
The child thread is handling the communication with an external application. Therefore, I will need to queue/sockets between the main thread and its child. One for outgoing messages and one for incoming messages.
Which zmq socket should I use in order to achieve this.
Thanks in advance
By moving from using shared memory and mutexes to using ZeroMQ, you are entering the realm of Actor model programming.
This, in my opinion, is a fairly good thing. However, there are some things to be aware of.
The only reason mutexes are no longer needed is because you are copying data, not sharing it. The 'cost' is that copying a lot of data takes a lot longer than locking a mutex that points to shared data. So you can end up with a nice looking Actor model program that runs like a dog in comparison to an equivalent program that uses shared memory / mutexes.
A caveat is that on complicated architectures like Intel Xeons with multiple CPUs, accessing shared memory can, conceivably, take just as long as copying it. This is because this may (depending on how lucky you've been) mean transactions across the QPI bus. Actor model programming is ideal for NUMA hardware architectures. Modern Intel and AMD architectures are, partially/fundamentally, NUMA, but the protocols they run over QPI / Hypertransport "fake" an SMP environment.
I would avoid ZMQ_PAIR sockets wherever practicable. They don't work across network connections. This means that if, for any reason, your application needs to scale across multiple computers you have to re-write your code. However, if you use different socket types from the very beginning, a scale-up of your application is nothing more than a matter of redeploying your code, not changing it. FYI nanomsg PAIRs do not have this restriction.
Don't for one moment assume that Actor model programming is going to solve all your problems. It brings in a whole suite of problems all of it's own. You can still deadlock, livelock, spinlock, etc. The problem with Actor model programmes is that these problems can be lurking in your code for years and never happen, until one day the network is just a little bit busier and -bam- your program stops running...
However, there is a development of Actor model programming called "Communicating Sequential Processes". This doesn't solve those problems, but if you've written your program with these problems they are guaranteed to happen every single time. So you discover the problem during development and testing, not five years later. There's also a process calculi for it, i.e. you can algebraically prove that your design is problem free before you ever write a single line of code. ZeroMQ is not CSP. Interestingly CSP is making something of a comeback - the Rust and Go languages both do CSP. However, they do not do CSP across network connections - it's all in-process stuff. Erlang does CSP too, and AFAIK does it across network connections.
Assuming you've read all that about CSP and are still going to use ZeroMQ, think carefully about what it is you are planning on sending across the ZeroMQ sockets. If it's all within one program on the same machine, then sending copies of, for example, arrays of integers is fine. They'll still be interpretable as integers at the receiving end. However, if you have aspirations to send data through ZMQ sockets to another computer it's well worth considering some sort of serialisation technology. ZeroMQ delivers messages. Why not make those messages the byte stream from an object serialiser? Then you can guarantee that the received message will, after de-serialisation, mean something appropriate at the receiving end, instead of having to solve problems with endianness, etc.
Favourite serialisers for me include Google Protocol Buffers. It is language / operating system agnostic, giving lots of options for a heterogeneous system. ASN.1 is another really good option, it can be got for most of the important languages, and it has a rich set of wire formats (including XML and, now/soon, JSON, which gives some interesting inter-op options), and does Constraints (something Google PBufs don't do), but does tend to cost money if one wants really good tools for it. XML can be understood by almost anything, but is bloated. Basically it's worth picking one that doesn't tie you down to using, say, C#, or Python everywhere.
Good luck!

OpenGL - Client-server architecture is relatively slower, isn't it?

Citing:
In the client–server model, the commands issued by the client do not
necessarily get sent to the server immediately. If the client and
server are over a network, it will be very inefficient to send
individual commands over the network. Instead, the commands can be
buffered on the client side and then issued to the server at a later
point in time. As a result, there needs to be a mechanism that lets
the client know when the server has completed execution of previously
submitted commands.
OpenGL®ES 2.0 Programming Guide - Aaftab Munshi, Dan Ginsburg, Dave Shreiner
On the other hand, OpenGL error reporting mechanism (glGetError) and other things are implemented so that minimal checks are implemented in order to achieve maximal performance.
Isn't client-server architecture relatively slower then if it was not client-server (stand alone?)? Was it really worth to enable working on different machines using client-server architecture?
While OpenGL has been designed to enable a client-server implementation, a particular implementation is free to implement things in a direct rendering way, i.e. OpenGL calls are executed immediately or with little delay.
However it must be clearly stated that a client-server architecture does not imply a performance penalty. In fact the asynchronous execution model of OpenGL allows implementations so much leeway in the details how operations are carried out, that by internally rearranging the order of operations a better throughput can be achieved than in a straight in-order execution. The only requirement is, that the result of the operations must match those as if things were executed in order. But this is no different to out-of-order execution on modern CPUs.
On the other hand, OpenGL error reporting mechanism (glGetError) and other things are implemented so that minimal checks are implemented in order to achieve maximal performance.
Well, actually OpenGL requires a lot of internal error checking and validation (this is one of the points addressed by Vulkan, allowing to disable or remove this error checking and validation in release builds). So this statement of yours is not accurate.

What is instrumentation point?

I found the concept as in a paper on dynamic instrumentation. But I couldnt find the explanation of this concept. Please explain, if possible...
EDIT: or is there any tutorial on how to achieve lightweight dynamic instrumentation (in user space, for syscalls and normal function calls)?
EDIT(Added paper details):
A code generation approach to optimizing high-performance distributed data stream processing
Abstract:
We present a code-generation-based
optimization approach to bringing
performance and scalability to
distributed stream processing
applications. We express stream
processing applications using an
operator-based, stream-centric
language called SPADE, which supports
composing distributed data flow graphs
out of toolkits of type-generic
operators. A major challenge in
building such applications is to find
an effective and flexible way of
mapping the logical graph of operators
into a physical one that can be
deployed on a set of distributed
nodes. This involves finding how best
operators map to processes and how
best processes map to computing nodes.
In this paper, we take a two-stage
optimization approach, where an
instrumented version of the
application is first generated by the
SPADE compiler to profile and collect
statistics about the processing and
communication characteristics of the
operators within the application. In
the second stage, the profiling
information is fed to an optimizer to
come up with a physical data flow
graph that is deployable across nodes
in a computing cluster. This approach
not only creates highly optimized
applications that are tailored to the
underlying computing and networking
infrastructure, but also makes it
possible to re-target the application
to a different hardware setup by
simply repeating the optimization step
and re-compiling the application to
match the physical flow graph produced
by the optimizer. Using real-world
applications, from diverse domains
such as finance and radio-astronomy,
we demonstrate the effectiveness of
our approach on System S -- a
large-scale, distributed stream
processing platform.
Instrumentation means inserting code into a stream of instructions whose purpose is to measure something -- execution time, function calls, data access, all sorts of things relating to profiling. That's one of two ways to do profiling, and it's the more accurate but slower one. The other one is sampling, where you periodically interrupt the program and look at its current state. This has less performance impact but isn't as accurate, especially for short runs.
Without knowing what paper you are referencing it is difficult to be sure, but in general it would be a place in the code that has a "hook" for instrumentation.
That is, it is coded so it can be dynamically instrumented, so some measurements can be recorded about how the code runs.
Whether this would be for time spent in a method, power consumption or something else depends on what and how it is being instrumented.
It would be useful to see a link to the paper for the context.
In a tool such as systemtap/gdb, an instrumentation point would be any place in the code, whose execution can yield an event. For "dynamic" instrumentation, there is usually no need to compile a hook into the code; the tool just needs to determine a PC address where a breakpoint can be inserted.

Boost.MPI vs Boost.Asio

Good day!
What difference between these libraries?
I read MPI's docs and have small experience with asio. For me it's different
implementations of network communication and no more.
But each of them introduces different abstractions ( I'm not sure about same level
of these abstractions ) which leads to different application design.
When I should use one library or another? What I must to know for choosing right
decision in each separate situation?
Yes, Asio is good for several nodes (and very generic framework in general), but why MPI is less better for such tasks? I don't think that dependency on MPI C library is restrictive or MPI is hard to understand and what about scalability? With Asio we can implement things like broadcasting and others and from another hand MPI doesn't forbid to write simple network applications. Is it conceptually difficult to rewrite Asio-specific logic with MPI if needed?
What about socket-like communications: if it's mandatory, we can encapsulate such one in module on Asio or any other framework and still use MPI for other communications.
For me sokets and MPI standart are different network services and it's not clear what is fundamental in real world, where distance from simple client-server pair to some medium computations is one step. Also I don't think that MPI has notable overhead in comparison with Asio.
Maybe it's bad question and all we need it's something like ICE (Internet Communications Engine)? Different languages support and again (as assures ZeroC) great performance.
And, of course, I never seen in any documentation topic like 'don't use this library for it!'.
I simply can't take such disunity: in one case it's sockets, in another - asynchronous messages and finally heavy middleware platform. Where is clarity in lifecycle of development? Maybe it's not fair question, but for starting to reduce this zoo we need some point.
Each library solves different problems, they don't really overlap. It also depends what you are trying to solve, and the communication patterns of your application. Use Boost.MPI for scalability, such as scaling to thousands, or tens of thousands of nodes. Depending on the underlying network architecture, MPI also excels at collective operations: gather, scatter, broadcast, etc.
Use Boost.Asio for a socket abstraction layer if you only need a handful of nodes, such as a single server and some clients. I'd suggest using Boost.Asio if you aren't already using an MPI distribution in some fashion.
I haven't used both of them, but Boost.ASIO is more an abstraction layer for networking on a low level, whereas Boost.MPI implements the MPI standard which let's you create distributed computing systems.
So if you need some, say, socket-like communication, I'd go with ASIO. If you want to do distributed computing and maybe even interoperate with MPI programs written in other languages/for other platforms, go with Boost.MPI.

Resources