Is there a standard for roundtrip message time in the majordomo pattern using ZeroMQ and how is that measured or found? - zeromq

I am not sure if my ZeroMQ Majordomo implementation is correct. I believe the guide suggests that this pattern can handle tens of thousands of messages per second using a synchronous-roundtrip pattern. My present solution seems to struggles to send more than one thousand messages per second. My goal is to get as close to I can to running at least ten thousand messages per second.
I am running all of the components of the Majrodomo pattern on separate Windows 2012 servers, each with 12 processors and 32 GB of ram, so I am sure there cannot be resource constraints. All of these servers are running within the same network as well, meaning they are not traversing a firewall. My code is runs slower due to the business logic incorporated into it, so for my speed testing I went back to the simple test code the is provided in the ZeroMQ guide. These are supposedly the clients that are used in the guide to show the messages per second. I have tried in a few different languages as well, including Delphi and C#, neither of which seem to be able to reach the promised speeds.
Code can be found for those here:
http://zguide.zeromq.org/page:all
I am wondering if I am expecting too much from the pattern. Roundtrip message times in the pattern seem to sit around 25ms when sending from 100 clients to a single broker, and then to 100 workers and back through the pattern. This seems slow, and sending 10000 messages from these clients takes about 4 seconds, which isn't anywhere near the promise of tens of thousands per second. Am I expecting too much of this pattern, or is there something I'm missing here.
By the way, I've seen posts about HWM (high water mark) being hit, but given that were in a synchronous pattern I don't believe that could be an issue, since we only are able to queue a max number of messages equivalent to our client count.

Q : ...struggles to send more than one thousand messages per second ... goal is to get as close to I can to running at least ten thousand messages per second.
ZeroMQ eco-system evolves beyond the state, when the performance testing was published.
The best first step :
is to explicitly publish all details of the System-under-Test - a.k.a. SuT ( for each of the hosts, its hardware details ( best by hwloc-alike "fingerprint" ), O/S configuration details ( prio, backgound workloads, costs of virtualised machine ( if applicable - i.e. how much vCPU-clocks were actually stolen from "inside" of the VM by the external hypervisor ), interfaces - O/S buffer settings, I/F-free-capacity ( workload background ), ToS / VLAN or other LAN-interconnect performance modifying details, LAN-interconnect L2-switching silicon theoretical performance ceilings and background workloads present prior, during and after the SuT episodes.
Next comes the code :
a full copy of all the code - a.k.a. the MCVE-(inspect-able and repeat-able)-code. Given no code present here so far, no one can judge your actual observations as per the root-cause of the observed ~ 1k [msgs/s]
Next comes a definition of "reference point" to be used in test :
at which defined / given "reference"-point(s) the SuT ought get measured and what are the expected ranges for the expected TaT in [ms].
I work rather long with ZeroMQ DLL-s ( since 2.11 - so it's indeed quite a long time ) and I believe one can easily either overload an AccessNode with an ingress traffic or, as easily as the former case, suffocate the ZeroMQ infrastructure, if improper configuration was let in and a service performance will decrease, if not degrade, if no due measures were taken to configure all the resources that stay on the service's critical path.
Without hard facts ( as declared in the previous few steps ) there remains but guesses or unsupported ( yet ) opinions.

Related

What are down sides of using ZeroMQ for sending large messages (up to gigabytes)?

I found that people don't recommend sending large messages with ZeroMQ. But it is a real headache for me to split the data (it is somewhat twisted). Why this is not recommended is there some specific reason? Can it be overcome?
Why this is not recommended?
Resources ...
Even the best Zero-Copy implementation has to have spare resources to store the payloads in several principally independent, separate locations:
|<fatMessageNo1>|
|...............|__________________________________________________________ RAM
|...............|<fatMessageNo1>|
|...............|...............|__________________Context().Queue[peerNo1] RAM
|...............|...............|<fatMessageNo1>|
|...............|...............|...............|________O/S.Buffers[L3/L2] RAM
Can it be overcome?
Sure, do not send Mastodon-sized-GB+ messages. May use any kind of an off-RAM representation thereof and send just a lightweight reference to allow a remote peer to access such an immense beast.
Many new questions added via comment:
I was concern more about something like transmission failure: what will zeromq do (will it try to retransmit automatically, will it be transparent for me etc). RAM is not so crucial - servers can have it more than enough and service that we write is not intended to have huge amount of clients at the same time. The data that I talk about is very interrelated (we have molecules/atoms info and bonds between them) so it is impossible to send a chunk of it and use it - we need it all)) – Paul 25 mins ago
You may be already aware that ZeroMQ is working under a Zen-of-Zero, where also a zero-warranty got its place.
So, a ZeroMQ dispatched message will either be delivered "through" error-free, or not delivered at all. This is a great pain-saver, as your code will receive only a fully-protected content atomically, so no tortured trash will ever reach your target post-processing. Higher level soft-protocol handshaking allows one to remain in control, enabling mitigations of non-delivered cases from higher levels of abstractions, so if your design apetite and deployment conditions permit, one can harness a brute force and send whatever-[TB]-BLOBs, at one's own risk of blocked both local and infrastructure resources, if others permit and don't mind ( ... but never on my advice :o) )
Error-recovery self-healing - from lost-connection(s) and similar real-life issues - is handled if configuration, resources and timeouts permit, so a lot of troubles with keeping L1/L2/L3-ISO-OSI layers issues are efficiently hidden from user-apps programmers.

Splitting work between multiple pollers?

My current setup for work on server side is like this -- I have a manager (with poller) which waits for incoming requests for work to do. Once something is received it creates worker (with separate poller, and separate ports/sockets) for the job, and further on worker communicates directly with client.
What I observe that when there is some intense traffic with any of the worker it disables manager somewhat -- ReceiveReady events are fired with significant delays.
NetMQ documentation states "Receiving messages with poller is slower than directly calling Receive method on the socket. When handling thousands of messages a second, or more, poller can be a bottleneck." I am so far below this limit (say 100 messages in a row) but I wonder whether having multiple pollers in single program does not clip performance even further.
I prefer having separate instances because the code is cleaner (separation of concerns), but maybe I am going against the principles of ZeroMQ? The question is -- is using multiple pollers in single program performance wise? Or in reverse -- do multiple pollers starve each other by design?
Professional system analysis may even require you to run multiple Poller() instances:
Design system based on facts and requirements, rather than to listen to some popularised opinions.
Implement performance benchmarks and measure details about actual implementation. Comparing facts against thresholds is a.k.a. a Fact-Based-Decision.
If not hunting for the last few hundreds of [ns], a typical scenario may look this way:
your core logic inside an event-responding loop is to handle several classes of ZeroMQ integrated signallin / messaging inputs/outputs, all in a principally non-blocking mode plus your design has to spend specific amount of relative-attention to each such class.
One may accept some higher inter-process latencies for a remote-keyboard ( running a CLI-interface "across" a network, while your event-loop has to meet a strict requirement not to miss any "fresh" update from a QUOTE-stream. So one has to create a light-weight Real-Time-SCHEDULER logic, that will introduce one high-priority Poller() for non-blocking ( zero-wait ), another one with ~ 5 ms test on reading "slow"-channels and another one with a 15 ms test on reading the main data-flow pipe. If you have profiled your event-handling routines not to last more than 5 ms worst case, you still can handle TAT of 25 ms and your event-loop may handle systems with a requirement to have a stable control-loop cycle of 40 Hz.
Not using a set of several "specialised" pollers will not allow one to get this level of scheduling determinism with an easily expressed core-logic to integrate in such principally stable control-loops.
Q.E.D.
I use similar design so as to drive heterogeneous distributed systems for FOREX trading, based on external AI/ML-predictors, where transaction times are kept under ~ 70 ms ( end-to-end TAT, from a QUOTE arrival to an AI/ML advised XTO order-instruction being submitted ) right due to a need to match the real-time constraints of the control-loop scheduling requirements.
Epilogue:
If the documentation says something about a poller performance, in the ranges above 1 kHz signal delivery, but does not mention anything about a duration of a signal/message handling-process, it does a poor service for the public.
The first step to take is to measure the process latencies, next, analyse the performance envelopes. All ZeroMQ tools are designed to scale, so has the application infrastructure -- so forget about any SLOC-sized examples, the bottleneck is not the poller instance, but a poor application use of the available ZeroMQ components ( given a known performance envelope was taken into account ) -- one can always increase the overall processing capacity available, with ZeroMQ we are in a distributed-systems realm from a Day 0, aren't we?
So in concisely designed + monitored + adaptively scaled systems no choking will appear.

Can I prevent ZeroMQ from occupying file descriptors?

I have a setup where a master process is setting up a ZMQ_ROUTER and then forks many child processes, which then connect to that router.
Whenever a child zmq_connect()'s to the master, one file descriptor is occupied.
This however limits the number of interacting processes to the number of allowed file descriptors ( per process ). For me ( linux ), this currently is just 1024.
That is way too small for my intended use ( a multi-agent / swarm simulation ).
Answer:
You can't, except when using an inter-thread socket type ( using an inproc:// transport-class ). All other protocols use one file descriptor per connection.
One new approach to reduce the number of necessary file descriptors per application, if that application has several services ( e.g. several tcp://<address:port> connections can be made to ), seems to be to use the resource property resource property, which allows one to combine several services to one endpoint.
The Swarm first:
First of all, a smart solution of the massive herd of agents requires both flexibility ( in swarm framework design for features' additions ) and efficiency ( for both scalability and speed ) so as to achieve fastest possible simulation run-times, in spite of PTIME & PSPACE obstacles, with possible risk of wandering into EXPTIME zone in more complex inter-agent communication schemes.
Efficiency next:
At the first moment, my guess was to rather use and customise a bit light-weight-er POSIX-based signalling/messaging framework nanomsg -- a younger sister of ZeroMQ from Martin SUSTRIK, co-father of ZeroMQ -- where a Context()-less design plus additional features alike SURVEY and BUS messaging archetype are of particular attractivity for swarms with your own software-designed problem-domain-specific messaging/signalling protocols:
100k+ with file_descriptors
Well, you need a courage. Doable, but sleeves up, it will require hands on efforts in kernel, tuning in system settings and you will pay for having such scale by increased overheads.
Andrew Hacking has explained both the PROS and CONS of the "just" increasing fd count ( not only on the kernel side of the system tuning and configuration ).
Other factors to consider are that while some software may use sysconf(OPEN_MAX) to dynamically determine the number of files that may be open by a process, a lot of software still uses the C library's default FD_SETSIZE, which is typically 1024 descriptors and as such can never have more than that many files open regardless of any administratively defined higher limit.
Andrew has also directed your kind attention to this, that may serve as an ultimate report on how to setup a system for 100k-200k connections.
Do static scales above 100k per host make any real sense for swarm simulations?
While still "technically" doable, there are further limits -- even nanomsg will not be able to push more than about 1.000.000 [MSGs/s] which is fairly well enough for most applications, that cannot keep the pace with this native speed of message-dispatch. Citations state some ~6 [us] for CPU-core to CPU-core transfer latencies and if the user-designed swarm-processing application cannot make the sending loop under some 3-4 [us] the performance ceiling is not anywhere close to cause an issue.
How to scale above that?
A distributed multi-host processing is the first dimension to attack the static scale of the swarm. Next would be a need to introduce an RDMA-injection so as to escape from the performance bottleneck of any stack-processing in the implementation of the distributed messaging / signalling. Yes, this can move your Swarm system into nanosecond-scale latencies zone, but at the cost of building an HPC / high-tech computing infrastructure ( which would be a great Project, if your Project sponsor can adjust financing of such undertaking -- + pls. pls. do let me know if yes, would be more than keen to join such swarm intelligence HPC-lab ), but worth to know about this implication before deciding on architecture and knowing the ultimate limits is the key to do it well from the very beginning.

Measuring ZeroMQ performances on a network

This is probably a very naïve question, but I'm really a newbie in that stuff.
I'd like to test 0MQ performances (latency, throughput) according to different communication patterns: REQ/REP, PUB/SUB, PUSH/PULL, ROUTER/DEALER and so on, ... and estimate how well, performance-wise, 0MQ would handle the various communication scenarios we encounter in our software.
When everything runs on the same machine, it is relatively easy to measure things and do basic statistics according to message size, etc. I know for sure when my messages are sent, and when they are received.
But how can I do measurements across the network without a common time
reference (which is accurate enough, I mean)? Do I measure round-trips (from machine A to machine B and back)? Is that a meaningful test?
ZeroMQ comes with performance testing tools; look in the perf/ directory. E.g. to test throughput, run local_thr on one machine, and remote_thr on the other. You can set message sizes and counts. Do test with sufficient messages to get accurate figure (test should run for at least 5-10 seconds).

How to do performance and scalability testing without clear requirements?

Any idea how to do performance and scalability testing if no clear performance requirements have been defined?
More information about my application.
The application has 3 components. One component can only run on Linux, the other two components are Java programs so they can run on Linux/Windows/Mac... The 3 components can be deployed to one box or each component can be deployed to one box. Deployment is very flexible. The Linux-only component will capture raw TCP/IP packages over the network, then one Java component will get those raw data from it and assemble them into the data end users will need and output them to hard disk as data files. The last Java component will upload data from data files to my database in batch.
In the absence of 'must be able to perform X iterations within Y seconds...' type requirements, how about these kinds of things:
Does it take twice as long for twice the size of dataset? (yes = good)
Does it take 10x as long for twice the size of dataset? (yes = bad)
Is it CPU bound?
Is it RAM bound (eg lots of swapping to virtual memory)?
Is it IO / Disk bound?
Is there a certain data-set size at which performance suddenly falls off a cliff?
Surprisingly this is how most perf and scalability tests start.
You can clearly do the testing without criteria, you just define the tests and measure the results. I think your question is more in the lines 'how can I establish test passing criteria without performance requirements'. Actually this is not at all uncommon. Many new projects have no clear criteria established. Informally it would be something like 'if it cannot do X per second we failed'. But once you passed X per second (and you better do!) is X the 'pass' criteria? Usually not, what happens is that you establish a new baseline and your performance tests guard against regression: you compare your current numbers with the best you got, and decide if the new build is 'acceptable' as build validation pass (usually orgs will settle here at something like 70-80% as acceptable, open perf bugs, and make sure that by ship time you get back to 90-95% or 100%+. So basically the performance test themselves become their own requirement.
Scalability is a bit more complicated, because there there is no limit. The scope of your test should be to find out where does the product break. Throw enough load at anything and eventually it will break. You need to know where that limit is and, very importantly, find out how does your product break. Does it give a nice error message and revert or does it spills its guts on the floor?
Define your own. Take the initiative and describe the performance goals yourself.
To answer any better, we'd have to know more about your project.
If there has been 'no performance requirement defined', then why are you even testing this?
If there is a performance requirement defined, but it is 'vague', can you indicate in what way it is vague, so that we can better help you?
Short of that, start from the 'vague' requirement, and pick a reasonable target that at least in your opinion meets or exceeds the vague requirement, then go back to the customer and get them to confirm that your clarification meets their requirements and ideally get formal sign-off on that.
Some definitions / assumptions:
Performance = how quickly the application responds to user input, e.g. web page load times
Scalability = how many peak concurrent users the applicaiton can handle.
Firstly perfomance. Performance testing can be quite simple, such as measuring and recording page load times in a development environment and using techniques like applicaiton profiling to identify and fix bottlenecks.
Load. To execute a load test there are four key factors, you will need to get all of these in place to be successfull.
1. Good usage models of how users will use your site and/or application. This can be easy of the application is already in use, but it can be extermely difficult if you are launching a something new, e.g. a Facebook application.
If you can't get targets as requirements, do some research and make some educated assumptions, document and circulate them for feedback.
2. Tools. You need to have performance testing scripts and tools that can excute the scenarios defined in step 1, with the number of expected users in step 1. (This can be quite expensive)
3. Environment. You will need a production like environment that is isolated so your tests can produce repoducible results. (This can also be very expensive.)
4. Technical experts. Once the applicaiton and environment starts breaking you will need to be able to identify the faults and re-configure the environment and or re-code the application once faults are found.
Generally most projects have a "performance testing" box that they need to tick because of some past failure, however they never plan or budget to do it properley. I normally recommend to do budget for and do scalability testing properley or save your money and don't do it at all. Trying to half do it on the cheap is a waste of time.
However any good developer should be able to do performance testing on their local machine and get some good benefits.
rely on tools (fxcop comes to mind)
rely on common sense
If you want to test performance and scalability with no requirements then you should create your own requirements / specs that can be done in the timeline / deadline given to you. After defining the said requirements, you should then tell your supervisor about it if he/she agrees.
To test scalability (assuming you're testing a program/website):
Create lots of users and data and check if your system and database can handle it. MyISAM table type in MySQL can get the job done.
To test performance:
Optimize codes, check it in a slow internet connection, etc.
Short answer: Don't do it!
In order to get a (better) definition write a performance test concept you can discuss with the experts that should define the requirements.
Make assumptions for everything you don't know and document these assumptions explicitly. Assumptions comprise everything that may be relevant to your system's behaviour under load. Correct assumptions will be approved by the experts, incorrect ones will provoke reactions.
For all of those who have read Tom DeMarcos latest book (Adrenaline Junkies ...): This is the strawman pattern. Most people who are not willing to write some specification from scratch will not hesitate to give feedback to your document. Because you need to guess several times when writing your version you need to prepare for being laughed at when being reviewed. But at least you will have better information.
The way I usually approach problems like this is just to get a real or simulated realistic workload and make the program go as fast as possible, within reason. Then if it can't handle the load I need to think about faster hardware, doing parts of the job in parallel, etc.
The performance tuning is in two parts.
Part 1 is the synchronous part, where I tune each "thread", under realistic workload, until it really has little room for improvement.
Part 2 is the asynchronous part, and it is hard work, but needs to be done. For each "thread" I extract a time-stamped log file of when each message sent, each message received, and when each received message is acted upon. I merge these logs into a common timeline of events. Then I go through all of it, or randomly selected parts, and trace the flow of messages between processes. I want to identify, for each message-sequence, what its purpose is (i.e. is it truly necessary), and are there delays between the time of receipt and time of processing, and if so, why.
I've found in this way I can "cut out the fat", and asynchronous processes can run very quickly.
Then if they don't meet requirements, whatever they are, it's not like the software can do any better. It will either take hardware or a fundamental redesign.
Although no clear performance and scalability goals are defined, we can use the high level description of the three components you mention to drive general performance/scalability goals.
Component 1: It seems like a network I/O bound component, so you can use any available network load simulators to generate various work load to saturate the link. Scalability can be measure by varying the workload (10MB, 100MB, 1000MB link ), and measuring the response time , or in a more precise way, the delay associated with receiving the raw data. You can also measure the working set of the links box to drive a realistic idea about your sever requirement ( how much extra memory needed to receive X more workload of packets, ..etc )
Component 2: This component has 2 parts, an I/O bound part ( receiving data from Component 1 ), and a CPU bound part ( assembling the packets ), you can look at the problem as a whole, make sure to saturate your link when you want to measure the CPU bound part, if is is a multi threaded component, you can look for ways to improve look if you don't get 100% CPU utilization, and you can measure time required to assembly X messages, from this you can calculate average wait time to process a message, this can be used later to drive the general performance characteristic of your system and provide and SLA for your users ( you are going to guarantee a response time within X millisecond for example ).
Component 3: Completely I/O bound, and depends on both your hard disk bandwidth, and the back-end database server you use, however you can measure how much do you saturate disk I/O to optimize throughput, how much I/O counts do you require to read X MB of data, and improve around these parameters.
Hope that helps.
Thanks

Resources